Skip to content

G2Lab/phenotyping-gwas-eval

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Overview

This README accompanies the paper 'Multi-domain rule-based phenotyping algorithms enable improved GWAS signal'. The contribution of the paper is to assess the impact of various rule-based phenotyping algorithms on GWAS outcomes, examining factors such as statistical power, heritability, replicability, functional annotations, and polygenic risk score prediction accuracy across seven polygenic diseases in the UK Biobank.

Phenotyping algorithms and diseases

We define the phenotyping algorithms and diseases studied initially by certain defining codes (i.e. UKBB field for ADO algorithms, OHDSI Phenotype Library code for OHDSI algorithms, Phecode for Phecode algorithms, and condition code for 2+ condition). All of these can be found below and in data/input_dict.json. These codes are subsequently mapped to be compliant with our postgres sql database cohort table. For clarity, all defining codes and postgres codes are listed below:

Disease Phenotype Defining code Postgres id
Alzheimers ADO 42020-0.0 42020
Alzheimers OHDSI 255 255
Alzheimers Phecode 290.11 29011222
Alzheimers 2+ condition 378419 378419111
Asthma ADO 42014-0.0 42014
Asthma OHDSI 27 27
Asthma Phecode 495 495222
Asthma 2+ condition 317009 317009111
COPD ADO 42016-0.0 42016
COPD OHDSI 28 28
COPD 2+ condition 255573 255573111
MI ADO 42000-0.0 42000
MI OHDSI 71 71
MI Phecode 411.2 4112222
MI 2+ condition 4329847 4329847111
RA OHDSI 196 196
RA Phecode 714.1 7141222
RA 2+ condition 80809 80809111
SLE OHDSI 119 119
SLE Phecode 695.42 69542222
SLE 2+ condition 257628 257628111
T2D OHDSI 288 288
T2D Phecode 250.2 2502222
T2D 2+ condition 201826 201826111

Project Structure

The main branch contains the original version of the code used prior to journal revisions, while the revisions branch reflects the most up-to-date version. The repository is organized into three folders:

code/

Contains code used in project, including:

  • SetupGWASScore.py: performs preliminary/setup tasks such as genotype quality control and phenotyping)
  • GWASScore.py: runs PLINK GWAS and generates a majority of the evaluation metrics used in the paper. Supports multiple modes
    • Mode 0: Run PLINK GWAS and compute evaluation metrics
    • Mode 1: Run PLINK GWAS only
    • Mode 2: Compute metrics only (for results from PLINK GWAS)
    • Mode 3: Compute metrics only (for results from SAIGE GWAS)
  • RunGWASScore.py: Deploys GWAS Score for all GWAS (i.e. by phenotyping algorithm and disease)
  • RunPhevaluator.py: Runs OHDSI PheValuator tool for all phenotyping algorithms
  • Analysis.py: Calculates additional evaluation metrics (ex. genetic correlation, gwas-eqtl colocalization) and generates graphs and tables used in paper
  • test_gwas.py: testing suite

Subdirectories:

  • SupportingScripts/: scripts used for setup and evaluation metrics. An important stand-alone script is:

    • PRSMetrics.py: generates GWAS and PRS from 5-fold cross validation and trains and tests overall logistic regression (LR) model
      • Mode 0: Run PLINK 5-fold GWAS and LR model evaluation
      • Mode 1: Run PLINK 5-fold GWAS only
      • Mode 2: LR model evaluation only (for results from PLINK GWAS)
      • Mode 3: LR model evaluation only (for results from SAIGE GWAS)
      • Mode 4: Run SAIGE 5-fold GWAS only
  • ldsc/: cloned version of the LDSC (LD Score Regression) repository

data/

Contains some data used in project including codes defining ADO algorithms, PheValuator algorithms, and input_dict (defines all diseases and algorithms under study). Large number of files (e.g. GTEx and ClinVar files) not pushed due to size limits.

revision/

Contains additional code used in revisions, including:

  • RunSAIGEAssoc.py: runs SAIGE GWAS
  • rev_analysis.ipynb: generates additional supplementary tables (PLINK/SAIGE comparison and index event breakdown)
  • scripts_to_rerun_sle_saige.txt: code used to redeploy PRSMetrics and GWASScore for SLE since switching to SAIGE results
  • rev_analysis_tests.py: testing suite for additional code from revisions
  • llm_t2d_code.py: code to generate T2D patients from LLM sql

Original Pipeline

The original pipeline followed these major stages. Steps marked with [parallelizable] can be run concurrently.

  1. SetupGWASScore.py
  2. RunGWASScore.py [parallelizable]
  3. RunPheValuator.py [parallelizable]
  4. PRSMetrics.py [parallelizable]
  5. Analysis.py

Revised Pipeline Overview

The pipeline with revisions follows these major stages. Steps marked with [parallelizable] can be run concurrently.

  1. SetupGWASScore.py
  2. RunSAIGEAssoc.py
  3. RunGWASScore.py [parallelizable] (GWASScore with mode 0 then mode 3 for SLE results with SAIGE)
  4. RunPheValuator.py [parallelizable]
  5. PRSMetrics.py [parallelizable] (mode 0 then mode 4 + mode 3 for SAIGE SLE results with SAIGE)
  6. Analysis.py

Other Details

Note: the following are synonymous: OMOPADO, ADO OHDSIPhenotypeLibrary, OHDSI Rollup, 2+ condition

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors