| Component | Status |
|---|---|
| Inference | β Functional |
| Training | π§ Experimental |
| Diffusion | π§ͺ Research Prototype |
uv run rna_predict/predict.py \
input_csv=rna_predict/dataset/examples/kaggle_minimal_index.csv \
checkpoint_path=outputs/2025-04-28/16-07-58/outputs/checkpoints/last.ckpt \
output_dir=outputs/predict_M2_test/ \
fast_dev_run=true \
> dev_run_output.txt 2>&1- uv run rna_predict/predict.py: Runs the main prediction pipeline using the project's preferred Python runner (never use
pythondirectly). - input_csv=...: Path to the CSV file listing RNA sequences for prediction.
- checkpoint_path=...: Path to the model checkpoint to use for inference.
- output_dir=...: Where to save all prediction outputs (CSV, PDB, and .pt files).
- fast_dev_run=true: (Optional) Runs a single sequence for quick debugging.
- > dev_run_output.txt 2>&1: Redirects all output (including errors) to a log file for later inspection.
- prediction_{i}.csv: Atom-level coordinates (atom name, residue index, x, y, z) for each prediction.
- prediction_{i}.pdb: Standard PDB file for molecular visualization.
- prediction_{i}.pt: (Optional) PyTorch dictionary for internal use.
- summary.csv: Atom counts and summary for all predictions.
- CSV: Easy to inspect, analyze, or import into data tools.
- PDB: Standard for 3D structure visualization (PyMOL, Chimera, VMD, etc).
- .pt: For advanced PyTorch workflows/debugging.
- Always use
uv runfor all project scripts for correct environment handling. - You can customize
input_csv,checkpoint_path, andoutput_diras needed. - For full pipeline or batch prediction, set
fast_dev_run=false(or omit). - All outputs are saved in the specified
output_dir.
- Network Access: First run requires internet to download
sayby/rna_torsionbertfrom Hugging Face (~328MB). - Stage A Weights: RFold requires manual download of checkpoint files (see
docs/pipeline/stageA/StageA_RFold.md). - Dependencies: All Python dependencies are managed via
pyproject.tomland installed automatically withuv.
Refer to the CONTRIBUTING.md file.
RNA molecules fold into intricate 3D structures critically determining their biological functions. This repository provides a pipeline:
- Stage A: RNA 2D adjacency (secondary structure).
- Stage B: Neural torsion-angle prediction.
- Stage C: Forward kinematics from angles to 3D coordinates, plus optional energy-based refinement.
Advanced modules include diffusion-based refinement (AlphaFold 3 inspired), isosteric base substitutions, and potential HPC integration (Kaggle competitions).
The pipeline, inspired by AlphaFold but specialized for RNA, includes:
- Predict or import base-pair matrix (e.g., RFold, ViennaRNA, RNAfold).
- Detailed documentation:
StageA_RFold.md,RFold_paper.md.
- Neural approaches:
AtomAttentionEncoder(atom_encoder.py,atom_transformer.py,block_sparse.py) orRNA-TorsionBERT(torsionBert_full_paper.md,torsionBert.md). - Predicts torsion angles (\alpha,\beta,\gamma,\delta,\epsilon,\zeta,\chi).
- Benchmarks (
rna_predict/benchmarks/benchmark.py) GPU latency and memory use.
- Stepwise conversion of torsion angles into 3D coordinates (pseudo-algorithm provided in
Stage_C.md,Stage_C_Refinement_Plan.md). - Optional energy minimization or molecular dynamics (MD) via GROMACS, AMBER, OpenMM.
- Diffusion-based refinement (
s4_diffusion.md,AlphaFold3_progress.md). - Isosteric base-substitution logic for redesign (
RNA_isostericity.md).
- Uses RFold or external tools. No single "StageA" Python file; adjacency computed externally.
- Outputs:
[N Γ N]adjacency or partial contact probabilities.
- Approaches: AtomAttentionEncoder (local adjacency), RNA-TorsionBERT (sequence-only).
- Outputs:
[N_res, 7]angles or[N_res, 2Γ7]sin/cos representations.
- Forward kinematics pseudo-algorithm detailed clearly.
- Optional MD refinements recommended (short minimization steps).
- Iterative denoising, inspired by AlphaFold 3 (
AngleDiffusionModule).
- Sequence redesign preserving geometry, detailed logic provided.
- Competitive scenarios explained (
kaggle_competition.md).
rna_predict/models/attention/rna_predict/models/encoder/rna_predict/scripts/rna_predict/benchmarks/
cd rna_predict
python runners/demo_entry.py- Local Block-Sparse Optimization: Significantly reduces GPU memory/time complexity.
- Benchmark specifics provided for large RNA handling.
- Explicit recommendation for chunking or dimension reduction for very large RNA sequences.
- Clearly defined pseudo-algorithmic logic.
- Sugar pucker handling: Standard (
C3β²-endo) or predicted angles.
coords[0] = place_first_residue(torsion_angles[0])
for i in range(1, N_res):
anchor = coords[i-1]
angles = torsion_angles[i]
coords[i] = build_next_residue(anchor, angles, standard_geom)
- Implement full pipeline (
rna_predict/pipeline.py) combining stages explicitly. - Create explicit
forward_kinematics.py. - Add small MLP torsion-head (
torsion_head.py). - Partial diffusion refinement module and confidence metrics.
π Conclusion:
Structured, MkDocs-friendly documentation, explicitly detailed with filenames, pipeline stages, algorithmic insights, and clear performance guidelines to enhance readability and comprehensive understanding.
This project implements and adapts methods from the following publications:
-
TorsionBERT (Stage B)
Bernard C, Postic G, Ghannay S, Tahi F. (2025). RNA-TorsionBERT: leveraging language models for RNA 3D torsion angles prediction.
Bioinformatics 41(1): btaf004. doi:10.1093/bioinformatics/btaf004
Model:sayby/rna_torsionbert(Hugging Face) -
MP-NeRF (Stage C)
Massively Parallel Natural Extension of Reference Frame method.
bioRxiv doi:10.1101/2021.06.08.446214
Based on: Parsons et al. (2005), AlQuraishi (2019), Bayati et al. (2020)
Implementation: github.com/EleutherAI/mp_nerf -
AlphaFold 3 Diffusion Module (Stage D)
Abramson J, et al. (2024). Accurate structure prediction of biomolecular interactions with AlphaFold 3.
Nature 630: 493β500. doi:10.1038/s41586-024-07487-w
Note: Stage D adapts AF3's coordinate-space diffusion (Algorithms 18-21) for RNA-specific refinement.
See docs/ directory for detailed implementation notes and paper summaries.