GitHub - G2Lab/phenotyping-gwas-eval

Overview

This README accompanies the paper 'Multi-domain rule-based phenotyping algorithms enable improved GWAS signal'. The contribution of the paper is to assess the impact of various rule-based phenotyping algorithms on GWAS outcomes, examining factors such as statistical power, heritability, replicability, functional annotations, and polygenic risk score prediction accuracy across seven polygenic diseases in the UK Biobank.

Phenotyping algorithms and diseases

We define the phenotyping algorithms and diseases studied initially by certain defining codes (i.e. UKBB field for ADO algorithms, OHDSI Phenotype Library code for OHDSI algorithms, Phecode for Phecode algorithms, and condition code for 2+ condition). All of these can be found below and in data/input_dict.json. These codes are subsequently mapped to be compliant with our postgres sql database cohort table. For clarity, all defining codes and postgres codes are listed below:

Disease	Phenotype	Defining code	Postgres id
Alzheimers	ADO	42020-0.0	42020
Alzheimers	OHDSI	255	255
Alzheimers	Phecode	290.11	29011222
Alzheimers	2+ condition	378419	378419111
Asthma	ADO	42014-0.0	42014
Asthma	OHDSI	27	27
Asthma	Phecode	495	495222
Asthma	2+ condition	317009	317009111
COPD	ADO	42016-0.0	42016
COPD	OHDSI	28	28
COPD	2+ condition	255573	255573111
MI	ADO	42000-0.0	42000
MI	OHDSI	71	71
MI	Phecode	411.2	4112222
MI	2+ condition	4329847	4329847111
RA	OHDSI	196	196
RA	Phecode	714.1	7141222
RA	2+ condition	80809	80809111
SLE	OHDSI	119	119
SLE	Phecode	695.42	69542222
SLE	2+ condition	257628	257628111
T2D	OHDSI	288	288
T2D	Phecode	250.2	2502222
T2D	2+ condition	201826	201826111

Project Structure

The main branch contains the original version of the code used prior to journal revisions, while the revisions branch reflects the most up-to-date version. The repository is organized into three folders:

`code/`

Contains code used in project, including:

SetupGWASScore.py: performs preliminary/setup tasks such as genotype quality control and phenotyping)
GWASScore.py: runs PLINK GWAS and generates a majority of the evaluation metrics used in the paper. Supports multiple modes
- Mode 0: Run PLINK GWAS and compute evaluation metrics
- Mode 1: Run PLINK GWAS only
- Mode 2: Compute metrics only (for results from PLINK GWAS)
- Mode 3: Compute metrics only (for results from SAIGE GWAS)
RunGWASScore.py: Deploys GWAS Score for all GWAS (i.e. by phenotyping algorithm and disease)
RunPhevaluator.py: Runs OHDSI PheValuator tool for all phenotyping algorithms
Analysis.py: Calculates additional evaluation metrics (ex. genetic correlation, gwas-eqtl colocalization) and generates graphs and tables used in paper
test_gwas.py: testing suite

Subdirectories:

SupportingScripts/: scripts used for setup and evaluation metrics. An important stand-alone script is:
- PRSMetrics.py: generates GWAS and PRS from 5-fold cross validation and trains and tests overall logistic regression (LR) model
  - Mode 0: Run PLINK 5-fold GWAS and LR model evaluation
  - Mode 1: Run PLINK 5-fold GWAS only
  - Mode 2: LR model evaluation only (for results from PLINK GWAS)
  - Mode 3: LR model evaluation only (for results from SAIGE GWAS)
  - Mode 4: Run SAIGE 5-fold GWAS only
ldsc/: cloned version of the LDSC (LD Score Regression) repository

`data/`

Contains some data used in project including codes defining ADO algorithms, PheValuator algorithms, and input_dict (defines all diseases and algorithms under study). Large number of files (e.g. GTEx and ClinVar files) not pushed due to size limits.

`revision/`

Contains additional code used in revisions, including:

RunSAIGEAssoc.py: runs SAIGE GWAS
rev_analysis.ipynb: generates additional supplementary tables (PLINK/SAIGE comparison and index event breakdown)
scripts_to_rerun_sle_saige.txt: code used to redeploy PRSMetrics and GWASScore for SLE since switching to SAIGE results
rev_analysis_tests.py: testing suite for additional code from revisions
llm_t2d_code.py: code to generate T2D patients from LLM sql

Original Pipeline

The original pipeline followed these major stages. Steps marked with [parallelizable] can be run concurrently.

SetupGWASScore.py
RunGWASScore.py [parallelizable]
RunPheValuator.py [parallelizable]
PRSMetrics.py [parallelizable]
Analysis.py

Revised Pipeline Overview

The pipeline with revisions follows these major stages. Steps marked with [parallelizable] can be run concurrently.

SetupGWASScore.py
RunSAIGEAssoc.py
RunGWASScore.py [parallelizable] (GWASScore with mode 0 then mode 3 for SLE results with SAIGE)
RunPheValuator.py [parallelizable]
PRSMetrics.py [parallelizable] (mode 0 then mode 4 + mode 3 for SAIGE SLE results with SAIGE)
Analysis.py

Other Details

Note: the following are synonymous: OMOPADO, ADO OHDSIPhenotypeLibrary, OHDSI Rollup, 2+ condition

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
code		code
data		data
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

Phenotyping algorithms and diseases

Project Structure

`code/`

`data/`

`revision/`

Original Pipeline

Revised Pipeline Overview

Other Details

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Overview

Phenotyping algorithms and diseases

Project Structure

code/

data/

revision/

Original Pipeline

Revised Pipeline Overview

Other Details

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`code/`

`data/`

`revision/`

Packages