MegaTrain

Full Precision Training of 100B+ Parameter LLMs on a Single GPU

A RAM-centric architecture that stores parameters in host memory and treats GPUs as transient compute engines, enabling full-precision training of 100B+ models on a single GPU.

Quick Start | Supported Models | Data Preparation | Performance | Citation

🚀 News

4/17/2025: Multi-GPU data parallelism — 4x H100 achieves 4.7x super-linear speedup over single GPU (1290 vs 272 TFLOPS on Qwen2.5-7B). No NCCL required — workers read from shared memory independently. Thanks to @ckgresla for the initial multi-GPU PR that inspired this implementation!
4/12/2025: Fully integrated with the VERL framework — single H100 GPU GRPO training for Qwen3.5-27B. See RL Training.

Features

Single GPU, Massive Models -- Train 120B+ models on one GPU by leveraging CPU RAM for parameter storage
Multi-GPU Data Parallelism -- Scale to multiple GPUs with super-linear speedup via spawn-based workers (no NCCL)
Universal Model Support -- Any HuggingFace decoder-only model works out of the box via AutoModelForCausalLM
Hybrid Architecture -- Automatic handling of mixed attention (linear + full) and MoE layers
LlamaFactory-style Data -- Flexible dataset_info.json registry with alpaca/sharegpt format support
1.84x Faster -- Outperforms DeepSpeed ZeRO-3 on 14B models through pipelined double-buffered execution
YAML Configuration -- Easy model/dataset/hyperparameter setup with 25+ ready-made configs

Quick Start

# Install
git clone https://github.com/DLYuanGod/MegaTrain.git
cd MegaTrain
pip install -e .

# SFT: Train with built-in demo data
python examples/sft/train.py --config examples/sft/configs/llama3_8b.yaml

# SFT: Train any supported model
python examples/sft/train.py --config examples/sft/configs/qwen3_5_27b.yaml

# SFT: Multi-GPU data parallel (no NCCL needed)
python examples/sft/train.py --config examples/sft/configs/qwen_7b_4gpu.yaml

# RL (GRPO): Single-GPU GRPO via VERL + MegaTrain + SGLang
CUDA_VISIBLE_DEVICES=0 bash examples/rl/run_qwen2_5_7b_megatrain.sh

Supported Models

Model Family	Model Sizes	Architecture
Qwen2/Qwen2.5	0.5B/1.5B/3B/7B/14B/32B/72B	Dense
Qwen3	0.6B/1.7B/4B/8B/14B/32B	Dense
Qwen3.5	0.8B/2B/4B/9B/27B	Hybrid (linear+full attn)
Qwen3.5 MoE	35B-A3B/122B-A10B/397B-A17B	Hybrid + MoE
Qwen3-Next	80B-A3B	Hybrid + MoE
Llama 2	7B/13B/70B	Dense
Llama 3/3.1/3.2/3.3	1B/3B/8B/70B	Dense
Llama 4	Scout-17B-16E/Maverick	MoE
Mistral	7B	Dense
Mixtral	8x7B/8x22B	MoE
DeepSeek (LLM/Code/R1)	7B/16B/67B	Dense
Phi-3/Phi-4	3.8B/14B	Dense
Gemma 2/3	2B/7B/9B/27B	Dense
GLM-4/GLM-4.5	9B/32B	Dense
InternLM 2/2.5	7B/20B	Dense
Yi 1.5	6B/9B/34B	Dense
Baichuan 2	7B/13B	Dense
GPT-OSS	20B/120B	Dense
Vision-Language Models (VLM)
Qwen2-VL/Qwen2.5-VL	2B/7B/72B	VLM (ViT + LLM)
Qwen3-VL	2B/4B/8B/32B	VLM (ViT + LLM)
Qwen3.5-VL	7B+	VLM (ViT + Hybrid LLM)
LLaVA/LLaVA-NeXT	7B/13B/34B	VLM
InternVL 2/2.5	2B/8B/26B/76B	VLM
Gemma 3 VL	4B/12B/27B	VLM
GLM-4V	9B	VLM
MiniCPM-V	2B/8B	VLM
Llama 4 VL	Scout/Maverick	VLM + MoE
Any HF decoder-only model	Any size	Auto-detected
Any HF VLM model	Any size	Auto-detected

MegaTrain uses HuggingFace's AutoModelForCausalLM / AutoModelForImageTextToText with automatic model structure discovery. Both LLM and VLM models are supported without code changes. Vision encoders are CPU-offloaded just like decoder layers — GPU only holds what's currently computing.

Data Preparation

MegaTrain supports a LlamaFactory-compatible data system with flexible format support.

Option 1: Dataset Registry (Recommended)

Register datasets in data/dataset_info.json and reference by name:

dataset:
  name: "alpaca_en_demo"    # name from dataset_info.json
  dataset_dir: "data"
  max_seq_len: 1024

Supports alpaca format, sharegpt format, local JSON/JSONL files, and HuggingFace Hub datasets. See data/README.md for details.

Option 2: Direct Path (Legacy)

dataset:
  path: "/path/to/arrow/dataset"
  query_field: "query"
  response_field: "response"

Provided Datasets

Dataset	Source	Format
alpaca_en_demo	Built-in	Alpaca
MetaMathQA	HuggingFace Hub	Alpaca
Open-Platypus	HuggingFace Hub	Alpaca
MathInstruct	HuggingFace Hub	Alpaca
CodeAlpaca-20k	HuggingFace Hub	Alpaca
ShareGPT4	HuggingFace Hub	ShareGPT
UltraChat-200k	HuggingFace Hub	ShareGPT
OpenThoughts-114k	HuggingFace Hub	ShareGPT
OpenR1-Math-94k	HuggingFace Hub	ShareGPT

Configuration

Caution

Do NOT guess the batch_size! Use our resource calculator to find the optimal batch size for your hardware. Wrong batch size leads to OOM or wasted GPU utilization.

python scripts/calc_resource.py

model:
  name: "Qwen/Qwen3.5-27B"
  dtype: "bfloat16"
  attn_implementation: "flash_attention_2"

dataset:
  name: "metamath"
  max_seq_len: 1024

training:
  batch_size: 64       # <-- Use calc_resource.py to determine this!
  num_steps: 500
  learning_rate: 1.0e-5

optimizer:
  type: "deepspeed_adam"

See examples/sft/configs/ for ready-made SFT configurations. See examples/rl/ for GRPO training scripts (VERL + MegaTrain).

Config	Model	Architecture
`qwen_7b.yaml`	Qwen 2.5 7B	Dense
`qwen3_8b.yaml`	Qwen 3 8B	Dense
`qwen3_5_27b.yaml`	Qwen 3.5 27B	Hybrid (linear+full attn)
`qwen3_next_80b.yaml`	Qwen3-Next 80B-A3B	Hybrid + MoE
`glm4_flash.yaml`	GLM-4.7-Flash	MoE
`llama3_8b.yaml`	Llama 3.1 8B	Dense
`gpt_oss_20b.yaml`	GPT-OSS 20B	MoE

RL Training (GRPO)

MegaTrain supports single-GPU RL post-training via GRPO (Group Relative Policy Optimization), fully integrated with the VERL framework.

Architecture

On a single GPU, three components coexist without weight reloading:

Component	Where	GPU Memory
SGLang (FP8)	GPU — rollout inference	~3.5 GB/B params (FP8 weights + KV cache)
MegaTrain	CPU→GPU — actor & ref training	~4-9 GB transient (layer-by-layer streaming)
VERL	Orchestration — data, advantages, logging	Minimal

MegaTrain stores all parameters and optimizer states in CPU RAM (~12 GB per 1B params). The GPU only holds one layer at a time during training, while SGLang's FP8 model weights stay resident for fast rollout generation.

Quick Start

# Qwen2.5-7B — recommended starting point (fast iteration, fits easily on 80GB)
CUDA_VISIBLE_DEVICES=0 bash examples/rl/run_qwen2_5_7b_megatrain.sh

# Qwen3.5-27B — full-scale single-GPU GRPO
CUDA_VISIBLE_DEVICES=0 bash examples/rl/run_qwen3_5_27b_megatrain.sh

Use a local model path instead of downloading from HuggingFace:

MODEL_PATH=/path/to/Qwen2.5-7B \
CUDA_VISIBLE_DEVICES=0 bash examples/rl/run_qwen2_5_7b_megatrain.sh

Override any parameter via Hydra CLI:

CUDA_VISIBLE_DEVICES=0 bash examples/rl/run_qwen2_5_7b_megatrain.sh \
    data.train_batch_size=16 \
    actor_rollout_ref.rollout.n=4 \
    actor_rollout_ref.actor.optim.lr=5e-7

Tested Configurations (Single H100 80GB, GSM8K)

Model	Batch Size	n	GPU Memory	Time/Step	Throughput
Qwen2.5-7B	8	2	~62 GB	~60s	~120 tok/s
Qwen3.5-27B	2	2	~50 GB	~230s	~24 tok/s

Configuration Reference

All parameters are standard VERL Hydra configs. Key MegaTrain-specific knobs:

# Use MegaTrain as training backend (actor + reference)
model_engine=megatrain
actor_rollout_ref.actor.strategy=megatrain
actor_rollout_ref.ref.strategy=megatrain

# MegaTrain engine tuning
actor_rollout_ref.actor.megatrain.checkpoint_interval=4    # gradient checkpoint every N layers
actor_rollout_ref.actor.megatrain.num_grad_slabs=12        # async gradient buffer count
actor_rollout_ref.actor.megatrain.max_seq_len=1536         # max sequence length

# SGLang FP8 rollout
actor_rollout_ref.rollout.name=sglang
actor_rollout_ref.rollout.quantization=fp8
actor_rollout_ref.rollout.gpu_memory_utilization=0.5       # fraction for KV cache (after model weights)

See examples/rl/ for ready-to-run scripts. See verl/workers/engine/megatrain/ for the engine implementation.

Key Techniques

Double buffering for overlapped CPU→GPU weight transfer
Per-layer structure grouping for hybrid/MoE architectures
Gradient checkpointing every K layers to reduce GPU memory
Async gradient collection with slab pool and worker thread
FP8 quantized rollout via SGLang for 2x memory savings on inference
ref_in_actor pointer swap — zero-cost reference log-prob computation
HuggingFace native Flash Attention integration
DeepSpeed CPUAdam for 5-7x faster optimizer steps

Installation

git clone https://github.com/DLYuanGod/MegaTrain.git
cd MegaTrain
pip install -e .

# Optional: faster attention & optimizer
pip install flash-attn
pip install flash-linear-attention causal-conv1d  # for Qwen3.5 linear attention
pip install deepspeed                              # for CPUAdam optimizer

# Required for RL (GRPO) training
pip install verl                                   # VERL framework
pip install sglang[all]                            # SGLang rollout engine

Troubleshooting

Out of Memory?

Reduce batch_size in config
Increase checkpoint_interval
Reduce max_seq_len

Slow Training?

Use deepspeed_adam optimizer (5-7x faster than PyTorch AdamW)
Install Flash Attention
Install flash-linear-attention + causal-conv1d for Qwen3.5 models
Increase num_workers for data loading

New Model Not Working?

Ensure it's a decoder-only model (not encoder-decoder like T5)
Check trust_remote_code: true in config if the model requires it
Try attn_implementation: "sdpa" or "eager" if flash attention fails

Citation

If you use MegaTrain in your research, please cite:

@misc{yuan2026megatrainprecisiontraining100b,
      title={MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU}, 
      author={Zhengqing Yuan and Hanchi Sun and Lichao Sun and Yanfang Ye},
      year={2026},
      eprint={2604.05091},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2604.05091}, 
}

Acknowledgement

This project benefits from the following open-source works:

LLaMA-Factory -- Our data loading system (dataset_info.json registry, alpaca/sharegpt format support) is inspired by LlamaFactory's elegant dataset management design. Thanks to @hiyouga and all contributors.
VERL -- RL post-training framework. MegaTrain integrates as a VERL training backend for single-GPU GRPO/PPO/DPO training.
HuggingFace Transformers -- Universal model loading and native Flash Attention integration.
DeepSpeed -- SIMD-accelerated CPUAdam optimizer.
Flash Attention -- Memory-efficient attention and cross-entropy loss.
Flash Linear Attention -- Efficient linear attention kernels for hybrid models like Qwen3.5.

License

This repository is licensed under the Apache-2.0 License.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
build		build
csrc		csrc
data		data
examples		examples
horizon_lm.egg-info		horizon_lm.egg-info
infinity		infinity
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
LICENSE		LICENSE
QUICKSTART.md		QUICKSTART.md
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MegaTrain

Full Precision Training of 100B+ Parameter LLMs on a Single GPU

🚀 News

Features

Quick Start

Supported Models

Data Preparation

Option 1: Dataset Registry (Recommended)

Option 2: Direct Path (Legacy)

Provided Datasets

Configuration

RL Training (GRPO)

Architecture

Quick Start

Tested Configurations (Single H100 80GB, GSM8K)

Configuration Reference

Key Techniques

Installation

Troubleshooting

Citation

Acknowledgement

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

MegaTrain

Full Precision Training of 100B+ Parameter LLMs on a Single GPU

🚀 News

Features

Quick Start

Supported Models

Data Preparation

Option 1: Dataset Registry (Recommended)

Option 2: Direct Path (Legacy)

Provided Datasets

Configuration

RL Training (GRPO)

Architecture

Quick Start

Tested Configurations (Single H100 80GB, GSM8K)

Configuration Reference

Key Techniques

Installation

Troubleshooting

Citation

Acknowledgement

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages