This README accompanies the paper 'Multi-domain rule-based phenotyping algorithms enable improved GWAS signal'. The contribution of the paper is to assess the impact of various rule-based phenotyping algorithms on GWAS outcomes, examining factors such as statistical power, heritability, replicability, functional annotations, and polygenic risk score prediction accuracy across seven polygenic diseases in the UK Biobank.
We define the phenotyping algorithms and diseases studied initially by certain defining codes (i.e. UKBB field for ADO algorithms, OHDSI Phenotype Library code for OHDSI algorithms, Phecode for Phecode algorithms, and condition code for 2+ condition). All of these can be found below and in data/input_dict.json. These codes are subsequently mapped to be compliant with our postgres sql database cohort table. For clarity, all defining codes and postgres codes are listed below:
| Disease | Phenotype | Defining code | Postgres id |
|---|---|---|---|
| Alzheimers | ADO | 42020-0.0 | 42020 |
| Alzheimers | OHDSI | 255 | 255 |
| Alzheimers | Phecode | 290.11 | 29011222 |
| Alzheimers | 2+ condition | 378419 | 378419111 |
| Asthma | ADO | 42014-0.0 | 42014 |
| Asthma | OHDSI | 27 | 27 |
| Asthma | Phecode | 495 | 495222 |
| Asthma | 2+ condition | 317009 | 317009111 |
| COPD | ADO | 42016-0.0 | 42016 |
| COPD | OHDSI | 28 | 28 |
| COPD | 2+ condition | 255573 | 255573111 |
| MI | ADO | 42000-0.0 | 42000 |
| MI | OHDSI | 71 | 71 |
| MI | Phecode | 411.2 | 4112222 |
| MI | 2+ condition | 4329847 | 4329847111 |
| RA | OHDSI | 196 | 196 |
| RA | Phecode | 714.1 | 7141222 |
| RA | 2+ condition | 80809 | 80809111 |
| SLE | OHDSI | 119 | 119 |
| SLE | Phecode | 695.42 | 69542222 |
| SLE | 2+ condition | 257628 | 257628111 |
| T2D | OHDSI | 288 | 288 |
| T2D | Phecode | 250.2 | 2502222 |
| T2D | 2+ condition | 201826 | 201826111 |
The main branch contains the original version of the code used prior to journal revisions, while the revisions branch reflects the most up-to-date version. The repository is organized into three folders:
Contains code used in project, including:
- SetupGWASScore.py: performs preliminary/setup tasks such as genotype quality control and phenotyping)
- GWASScore.py: runs PLINK GWAS and generates a majority of the evaluation metrics used in the paper. Supports multiple modes
- Mode 0: Run PLINK GWAS and compute evaluation metrics
- Mode 1: Run PLINK GWAS only
- Mode 2: Compute metrics only (for results from PLINK GWAS)
- Mode 3: Compute metrics only (for results from SAIGE GWAS)
- RunGWASScore.py: Deploys GWAS Score for all GWAS (i.e. by phenotyping algorithm and disease)
- RunPhevaluator.py: Runs OHDSI PheValuator tool for all phenotyping algorithms
- Analysis.py: Calculates additional evaluation metrics (ex. genetic correlation, gwas-eqtl colocalization) and generates graphs and tables used in paper
- test_gwas.py: testing suite
Subdirectories:
-
SupportingScripts/: scripts used for setup and evaluation metrics. An important stand-alone script is:- PRSMetrics.py: generates GWAS and PRS from 5-fold cross validation and trains and tests overall logistic regression (LR) model
- Mode 0: Run PLINK 5-fold GWAS and LR model evaluation
- Mode 1: Run PLINK 5-fold GWAS only
- Mode 2: LR model evaluation only (for results from PLINK GWAS)
- Mode 3: LR model evaluation only (for results from SAIGE GWAS)
- Mode 4: Run SAIGE 5-fold GWAS only
- PRSMetrics.py: generates GWAS and PRS from 5-fold cross validation and trains and tests overall logistic regression (LR) model
-
ldsc/: cloned version of the LDSC (LD Score Regression) repository
Contains some data used in project including codes defining ADO algorithms, PheValuator algorithms, and input_dict (defines all diseases and algorithms under study). Large number of files (e.g. GTEx and ClinVar files) not pushed due to size limits.
Contains additional code used in revisions, including:
- RunSAIGEAssoc.py: runs SAIGE GWAS
- rev_analysis.ipynb: generates additional supplementary tables (PLINK/SAIGE comparison and index event breakdown)
- scripts_to_rerun_sle_saige.txt: code used to redeploy PRSMetrics and GWASScore for SLE since switching to SAIGE results
- rev_analysis_tests.py: testing suite for additional code from revisions
- llm_t2d_code.py: code to generate T2D patients from LLM sql
The original pipeline followed these major stages. Steps marked with [parallelizable] can be run concurrently.
- SetupGWASScore.py
- RunGWASScore.py [parallelizable]
- RunPheValuator.py [parallelizable]
- PRSMetrics.py [parallelizable]
- Analysis.py
The pipeline with revisions follows these major stages. Steps marked with [parallelizable] can be run concurrently.
- SetupGWASScore.py
- RunSAIGEAssoc.py
- RunGWASScore.py [parallelizable] (GWASScore with mode 0 then mode 3 for SLE results with SAIGE)
- RunPheValuator.py [parallelizable]
- PRSMetrics.py [parallelizable] (mode 0 then mode 4 + mode 3 for SAIGE SLE results with SAIGE)
- Analysis.py
Note: the following are synonymous: OMOPADO, ADO OHDSIPhenotypeLibrary, OHDSI Rollup, 2+ condition