Genotyping of Targeted loci with single-cell Chromatin Accessibility (GoT-ChA) β Accelerated & Optimized
______ _______ ____ ___
/ ____| __|__ __| / ____| | / \ __
| | __ / \ | | | | |__ / ^ \ / \
| | |_ || () | | | - | | | \ / /_\ \| () |
| |__| | \__/ | | | |____ | |/ _____ \\__/
\______| |_| \______ | /__/ \__\
GoT...Ciao! (It's already done). π€
GoT-ChAo is a high-performance reimplementation of the Landau Lab GoTChA pipeline. It is designed to perform precise genotyping of single-cell RNA-seq data (10x Genomics) to distinguish between Wild Type (WT) and Mutant (MUT) cells based on specific primers and gene targets.
The original pipeline was written in R/Python. While accurate, it faced performance bottlenecks when processing massive datasets (100M+ reads). GoT-ChAo solves this by using a hybrid architecture that reduces runtime from hours to minutes:
- Rust Core: Handles heavy FASTQ parsing, barcode matching, and primer searching using SIMD acceleration and Rayon parallelism.
- Optimized Python: Re-engineers statistical genotyping with vectorized NumPy operations and analytical moments for noise correction.
| Feature | Original GoT-ChA | GoT-ChAo |
|---|---|---|
| Language | R / Python | Rust + Python |
| Speed (60M Reads) | Hours | < 3 Minutes |
| Memory Usage | High (Loads all reads) | Low (Streaming chunks) |
| Deployment | Conda Env | Docker / Apptainer |
GoT-ChAo is containerized. You do not need Rust or Python installed. Choose your environment:
If you are on a shared cluster (e.g., Slurm, LSF), you likely cannot run Docker. Use Apptainer (formerly Singularity).
- Requirements:
apptainerorsingularityloaded. - NYGC Users: Load
squashfuse-defaultvia Conda or Modules before building.
If you have root access or are running locally.
- Requirements: Docker Desktop or Docker Engine.
We automatically build and verify images on GitHub Actions.
You can pull the Docker image directly into a Singularity/Apptainer file (.sif).
# This converts the Docker image to a SIF file automatically
apptainer build gotchao.sif docker://ghcr.io/theob0t/gotchao:latestYou now have gotchao.sif (approx 500MB) ready to run.
docker pull ghcr.io/theob0t/gotchao:latestThe .sif file is a standalone executable. We use -B to bind your data folders so the container can see them.
# Example: Running on a cluster
# Syntax: apptainer run -B <COMMON_PARENT_DIR> gotchao.sif ...
# Example: Binding /gpfs so the container sees inputs AND can write outputs there
apptainer run -B /gpfs ./gotchao.sif \
--barcode_fastq_path /path/to/data/your_sample_barcode.fastq.gz \
--sequence_fastq_path /path/to/data/your_sample_gotcha.fastq.gz \
--whitelist_path /path/to/data/singlecell.csv \
--primer_sequence <YOUR_PRIMER> \
--ref_codon <WT> \
--mutation_codon <MUT> \
--mutation_start <START> \
--mutation_end <END> \
--out /gpfs/project/results/sample_01You must mount your current directory (volume) so results persist after the container stops.
# Syntax: -v <HOST_PATH>:<CONTAINER_PATH>
# Example: We map the current folder $(pwd) to /data inside the container
docker run --rm -v $(pwd):/data ghcr.io/theob0t/gotchao:latest \
--barcode_fastq_path /data/your_sample_barcode.fastq.gz \
--sequence_fastq_path /data/your_sample_gotcha.fastq.gz \
--whitelist_path /data/singlecell.csv \
--primer_sequence <YOUR_PRIMER> \
--ref_codon <WT> \
--mutation_codon <MUT> \
--mutation_start <START> \
--mutation_end <END> \
--out /data/output_folder
We provide a micro-dataset to verify the pipeline works. Since the test data is stored in this repository, you must clone the repo to run the test.
-
Get the Test Data:
git clone https://github.com/theob0t/GoTChAo.git cd GoTChAo -
Get the Container:
# HPC Users: apptainer build gotchao.sif docker://ghcr.io/theob0t/gotchao:latest -
Run the Test Script: We provide a script that automatically runs the container on the local test data and verifies the results.
# This script detects if you have Docker or Apptainer and runs accordingly ./tests/run_test.sh- Expected Output:
SUCCESS: Results match!
- Expected Output:
| Argument | Description |
|---|---|
--barcode_fastq_path |
Path to the Cell Barcode FASTQ (usually R1 for 10x, but check your kit). |
--sequence_fastq_path |
Path to the Biological Sequence FASTQ (usually R2). |
--whitelist_path |
Path to CellRanger singlecell.csv (barcodes to analyze). |
--primer_sequence |
The anchor sequence to search for. |
--ref_codon |
Wild Type codon (e.g., GAG). |
--mutation_codon |
Mutant codon (e.g., GGA). |
--mutation_start |
1-based start index of the codon relative to the read. |
--mutation_end |
1-based end index of the codon. |
--out |
Directory for results. |
--no_rc |
Disable Reverse Complementing of barcodes (Default: ON). |
GoT-ChAo is extremely efficient compared to pure Python/R implementations.
- CPU: Scales linearly. 4-8 cores is the sweet spot.
- RAM:
- Rust Phase: < 1GB (Streaming architecture).
- Python Phase: Depends on cell count (~4GB for 10k cells).
- Disk: Read speed is the main bottleneck (~700k reads/sec on GPFS).
{SAMPLE}_counts.csv: Raw aggregated counts (Rust output).{SAMPLE}_genotype_labels.csv: Final Result. Genotypes (WT,MUT,HET,NA) and confidence scores.*_kde_mixture.pdf: QC plot of noise distribution.cluster_genotype.pdf: Visualization of cell clusters.
- Original Concept: Landau Lab GoTChA