Skip to content

Theob0t/GoTChAo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

9 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

GoT-ChAo βš‘οΈπŸ¦€

Genotyping of Targeted loci with single-cell Chromatin Accessibility (GoT-ChA) β€” Accelerated & Optimized

CI Status Rust Python Docker Apptainer

         ______     _______     ____          ___          
        / ____|  __|__   __|  / ____| |      /   \    __   
        | |  __  /  \  | |    | |     |__   /  ^  \  /  \  
        | | |_ || () | | |  - | |     |  \ /  /_\  \| () | 
        | |__| | \__/  | |    | |____ |  |/  _____  \\__/  
        \______|       |_|    \______ |  /__/     \__\     

              GoT...Ciao! (It's already done). 🀌

πŸ“– Overview

GoT-ChAo is a high-performance reimplementation of the Landau Lab GoTChA pipeline. It is designed to perform precise genotyping of single-cell RNA-seq data (10x Genomics) to distinguish between Wild Type (WT) and Mutant (MUT) cells based on specific primers and gene targets.

πŸš€ Why GoT-ChAo?

The original pipeline was written in R/Python. While accurate, it faced performance bottlenecks when processing massive datasets (100M+ reads). GoT-ChAo solves this by using a hybrid architecture that reduces runtime from hours to minutes:

  1. Rust Core: Handles heavy FASTQ parsing, barcode matching, and primer searching using SIMD acceleration and Rayon parallelism.
  2. Optimized Python: Re-engineers statistical genotyping with vectorized NumPy operations and analytical moments for noise correction.
Feature Original GoT-ChA GoT-ChAo
Language R / Python Rust + Python
Speed (60M Reads) Hours < 3 Minutes
Memory Usage High (Loads all reads) Low (Streaming chunks)
Deployment Conda Env Docker / Apptainer

πŸ›  Prerequisites

GoT-ChAo is containerized. You do not need Rust or Python installed. Choose your environment:

πŸ…°οΈ For HPC Users (Clusters/Supercomputers)

If you are on a shared cluster (e.g., Slurm, LSF), you likely cannot run Docker. Use Apptainer (formerly Singularity).

  • Requirements: apptainer or singularity loaded.
  • NYGC Users: Load squashfuse-default via Conda or Modules before building.

πŸ…±οΈ For Docker Users (Local Laptop/Cloud)

If you have root access or are running locally.


πŸ“¦ Installation

We automatically build and verify images on GitHub Actions.

Method 1: HPC / Apptainer (Recommended for Clusters)

You can pull the Docker image directly into a Singularity/Apptainer file (.sif).

# This converts the Docker image to a SIF file automatically
apptainer build gotchao.sif docker://ghcr.io/theob0t/gotchao:latest

You now have gotchao.sif (approx 500MB) ready to run.

Method 2: Docker Pull

docker pull ghcr.io/theob0t/gotchao:latest

πŸƒβ€β™‚οΈ Usage

1. Running on HPC (Apptainer)

The .sif file is a standalone executable. We use -B to bind your data folders so the container can see them.

# Example: Running on a cluster
# Syntax: apptainer run -B <COMMON_PARENT_DIR> gotchao.sif ...

# Example: Binding /gpfs so the container sees inputs AND can write outputs there
apptainer run -B /gpfs ./gotchao.sif \
  --barcode_fastq_path /path/to/data/your_sample_barcode.fastq.gz \
  --sequence_fastq_path /path/to/data/your_sample_gotcha.fastq.gz \
  --whitelist_path /path/to/data/singlecell.csv \
  --primer_sequence <YOUR_PRIMER> \
  --ref_codon <WT> \
  --mutation_codon <MUT> \
  --mutation_start <START> \
  --mutation_end <END> \
  --out /gpfs/project/results/sample_01

2. Running with Docker

You must mount your current directory (volume) so results persist after the container stops.

# Syntax: -v <HOST_PATH>:<CONTAINER_PATH>

# Example: We map the current folder $(pwd) to /data inside the container
docker run --rm -v $(pwd):/data ghcr.io/theob0t/gotchao:latest \
  --barcode_fastq_path /data/your_sample_barcode.fastq.gz \
  --sequence_fastq_path /data/your_sample_gotcha.fastq.gz \
  --whitelist_path /data/singlecell.csv \
  --primer_sequence <YOUR_PRIMER> \
  --ref_codon <WT> \
  --mutation_codon <MUT> \
  --mutation_start <START> \
  --mutation_end <END> \
  --out /data/output_folder

πŸ§ͺ Testing (Demo)

We provide a micro-dataset to verify the pipeline works. Since the test data is stored in this repository, you must clone the repo to run the test.

  1. Get the Test Data:

    git clone https://github.com/theob0t/GoTChAo.git
    cd GoTChAo
  2. Get the Container:

    # HPC Users:
    apptainer build gotchao.sif docker://ghcr.io/theob0t/gotchao:latest
  3. Run the Test Script: We provide a script that automatically runs the container on the local test data and verifies the results.

    # This script detects if you have Docker or Apptainer and runs accordingly
    ./tests/run_test.sh
    • Expected Output: SUCCESS: Results match!

πŸ”§ Arguments Explained

Argument Description
--barcode_fastq_path Path to the Cell Barcode FASTQ (usually R1 for 10x, but check your kit).
--sequence_fastq_path Path to the Biological Sequence FASTQ (usually R2).
--whitelist_path Path to CellRanger singlecell.csv (barcodes to analyze).
--primer_sequence The anchor sequence to search for.
--ref_codon Wild Type codon (e.g., GAG).
--mutation_codon Mutant codon (e.g., GGA).
--mutation_start 1-based start index of the codon relative to the read.
--mutation_end 1-based end index of the codon.
--out Directory for results.
--no_rc Disable Reverse Complementing of barcodes (Default: ON).

πŸ’» Performance & Hardware

GoT-ChAo is extremely efficient compared to pure Python/R implementations.

  • CPU: Scales linearly. 4-8 cores is the sweet spot.
  • RAM:
    • Rust Phase: < 1GB (Streaming architecture).
    • Python Phase: Depends on cell count (~4GB for 10k cells).
  • Disk: Read speed is the main bottleneck (~700k reads/sec on GPFS).

πŸ“Š Outputs

  1. {SAMPLE}_counts.csv: Raw aggregated counts (Rust output).
  2. {SAMPLE}_genotype_labels.csv: Final Result. Genotypes (WT, MUT, HET, NA) and confidence scores.
  3. *_kde_mixture.pdf: QC plot of noise distribution.
  4. cluster_genotype.pdf: Visualization of cell clusters.

πŸ“ Attribution

About

High-performance reimplementation of the GoT-ChA pipeline

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors