An experimental research-driven LLM system focused on scientific reasoning, symbolic mathematics, and formal logic
Ananta aims to construct the first open-source, verifiable, self-learning scientific LLM, capable of:
- β Solving PhD-level mathematics and physics problems.
- β Generating fully correct, stepwise derivations
- β Discovering new lemmas and mathematical relations
- β Autoformalizing scientific papers into formal logic
- β Creating new scientific hypotheses grounded in symbolic verification
Core Philosophy: Go beyond typical transformer-based language models by constructing a hybrid architecture that understands symbolic math deeply, produces correct reasoning steps, verifies its own reasoning using logic, reduces hallucination, learns recursively, and evolves toward self-improving AI.
Ananta consists of three major subsystems working in harmony:
A specialized tokenizer that fixes fundamental limitations in BPE, SentencePiece, and GPT tokenizers for mathematical content.
Key Features:
- Separates digits, operators, math symbols, variables, and LaTeX expressions
- Provides better chain-of-thought stability
- Improves arithmetic reasoning accuracy
- Reduces token fragmentation
- Increases symbolic precision
Status: β
Complete implementation (see HMTT/ branch)
The symbolic "brainstem" of Ananta β a verification engine that ensures logical correctness.
Components:
- Syntax checking
- Axiom/theorem matching
- Semantic entailment (SMT/ATP solver integration)
- Stepwise validity verification
- Converts natural language scientific text into formal logic
- Validates formalized output
- Adds verified statements to symbolic memory
- Detects frequently used proof fragments
- Generalizes them into reusable lemmas
- Dynamically expands the knowledge base
- Grows recursively with validated proofs
- Stores axioms, theorems, and discovered lemmas
- Increases reasoning efficiency over time
Status: π§ In development
A fundamentally new reasoning paradigm that replaces next-token prediction with energy-based optimization.
Innovation: Instead of:
Predict next token β maximize likelihood
EB-SLE uses:
Assign energy to reasoning traces β minimize symbolic inconsistency
Key Elements:
- Energy functional with symbolic penalty terms
- Global coherence constraints
- Contrastive divergence sampling
- Verifier-guided gradients
- Recursive self-learning loop
Status: π§ Research phase
- Base Model:
deepseek-ai/deepseek-math-7b - Fine-Tuning: LoRA (Low-Rank Adaptation)
- Infrastructure: Dual H100 GPUs (upgraded from RTX 3050)
- DeepMind Mathematics Dataset
- GSM8K (Grade School Math)
- MATH (Hendrycks et al.)
- Custom symbolic derivation datasets
- Physics equation derivation sets (in progress)
- Block-level reasoning chains
- Structured step-by-step derivations
- Formal verification annotations
- SV: Symbolic Validity
- DC: Derivation Completeness
- AFA: Autoformalization Accuracy
- EG: Energy Gap
- STR: Symbolic Truthfulness Rate
graph TD
A[Raw Scientific Text] --> B[HMTT Tokenization]
B --> C[Base LLM Processing]
C --> D[Reasoning Generation]
D --> E[RLS Verification]
E --> F{Valid?}
F -->|Yes| G[Accept & Store]
F -->|No| H[EB-SLE Energy Update]
H --> D
G --> I[Symbolic Memory Update]
I --> J[Lemma Discovery]
# Extract and process mathematical datasets
python src/data/data_processor.py --dataset deepmind_math
python src/data/flexible_data_processor.py --format blocks# Fine-tune with LoRA
python src/training/easy_train.py --config configs/train_config.jsonInput Text β HMTT β Model Generation β RLS Verification β EB-SLE Update
# Run evaluation suite
python src/evaluation/evaluate_model.py --benchmark math# Clone the repository
git clone https://github.com/Prigoistic/ananta-oss.git
cd ananta-oss
# Install dependencies
pip install -r requirements.txt
# Optional: Install HMTT
git checkout HMTT
cd HMTT && pip install -e .from src.training.easy_train import train_model
# Train the model
train_model(
base_model="deepseek-ai/deepseek-math-7b",
dataset="deepmind_math",
output_dir="./models/ananta-v1"
)from HMTT import HMTTEncoder, HMTTDecoder
# Initialize
encoder = HMTTEncoder("path/to/tokenizer.json")
decoder = HMTTDecoder("path/to/tokenizer.json")
# Encode mathematical text
text = "The equation $E = mc^2$ represents mass-energy equivalence"
token_ids = encoder.encode(text)
# Decode
reconstructed = decoder.decode(token_ids)ananta/
βββ HMTT/ # Hybrid Math-Text Tokenizer (branch)
β βββ preprocessing/ # Text partitioning & tokenization
β βββ training/ # BPE vocabulary training
β βββ inference/ # Encoding & decoding
β βββ evaluation/ # TFS metrics
β βββ examples/ # Usage examples
βββ RLS/ # Recursive Logic Subsystem (planned)
βββ EBSL-Engine/ # Energy-Based Self-Learning (planned)
βββ src/
β βββ training/ # Model training scripts
β βββ data/ # Data processing utilities
β βββ evaluation/ # Evaluation metrics
β βββ utils/ # Helper functions
βββ configs/ # Configuration files
βββ demos/ # Demo applications
βββ deployment/ # Deployment scripts
β βββ huggingface/ # HF Spaces deployment
βββ docs/ # Documentation
β βββ QUICK_START.md
β βββ CONTRIBUTING.md
β βββ SIMPLE_README.md
βββ tests/ # Test suite
| Model | Weakness | Ananta Solution |
|---|---|---|
| GPT/Llama/Gemini | Predict tokens, not logic | Energy + symbolic correctness |
| Diffusion LLMs | No symbolic grounding | Verifier-constrained gradients |
| GAN-style models | Adversarial instability | Cooperative verifier loop |
| DeepSeek-R1 | No formal verification | RLS step-by-step validation |
| o3/DeepMind | Strong but black-box | Transparent & self-formalizing |
- Replace LoRA with RLHF + PPO
- Full EB-SLE engine integration
- Multi-agent scientific reasoning (research agents)
- Continuous-time reasoning models
- Differentiable theorem solvers
- Scientific hypothesis generation engine
- Breaking compute frontier optimization (Moosbauer-Poole inspired)
This project builds upon research in:
- Symbolic AI and automated theorem proving
- Energy-based models (LeCun et al.)
- Chain-of-thought reasoning (Wei et al.)
- Mathematical language models (Lewkowycz et al.)
- Formal verification systems (Lean, Coq, Isabelle)
We welcome contributions! This is an active research project.
Please see CONTRIBUTING.md for guidelines.
- Dataset curation and preprocessing
- Symbolic verification algorithms
- Energy functional design
- Benchmark development
- Documentation improvements
This project is licensed under the MIT License - see the LICENSE file for details.
Project Lead: Priyam Ghosh
Repository: github.com/Prigoistic/ananta-oss
- DeepSeek AI for the base mathematical reasoning model
- The open-source AI research community
- Contributors to symbolic mathematics libraries
- Automated theorem proving research community
Built with the goal of advancing scientific reasoning through symbolic AI
β Star this repo if you believe in verifiable, transparent AI reasoning!