π Paper β’ π€ Models β’ π» Code
Official PyTorch implementation of "Masked Diffusion Language Models with Frequency-Informed Training"
This repository implements a masked diffusion language modeling framework for data-efficient training under strict data constraints. Our approach combines:
- Masked Diffusion Language Models (MDLMs): A discrete diffusion approach that uses masked language modelling (bidirectional context) to generate text
- Novel Noise Schedules: Bimodal Gaussian (our best model) and cosine (submission's model)
- Frequency-Informed Masking: Progressive prioritization (curriculm) of rare tokens during training, which integrates seamlessly with the MDLM framework
- NELBO Reweighting: Exploration of different weighting schemes to optimize performance across schedules
Our method achieves competitive performance with state-of-the-art baselines (GPT-BERT) on the BabyLM benchmarks, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.
# Clone the repository
git clone https://github.com/DespoinaKK/babylm-diffusion
cd babylm-diffusion
# Install dependencies
pip install -r requirements.txtFor our models, we use the BabyLM corpus, for the strict track (100M words). You can find more information here.
First, train a BPE tokenizer on your corpus:
python tokenization/create_tokenizer.py \
--input_path /path/to/data \
--vocab_size 16384 \
--vocab_path ./tokenizers/tokenizer.jsonThen tokenize your dataset:
python tokenization/tokenize_corpus.py \
--data_folder /path/to/data \
--train_file data.train \
--valid_file data.valid \
--tokenizer_folder ./tokenizers \
--tokenizer_file tokenizer.json \
--name tokenizedFor distributed training, adapt the scripts in slurm-scripts. Our models are trained on a single node with 4 A100 64GB gpus (you can adapt the setup to utilize only 1 gpu, modifying the slurm*.sh scripts). To specify the hyperparameters and noise schedules you want, modify the config files in slurm-scripts.
Cosine Schedule (masking/noise_schedules.py):
- Focuses on lower masking rates
- Average masking rate: 0.36
Gaussian Schedules:
- Unimodal
- Bimodal: Mixture distribution combining low and high masking modes
- Requires derivative softening (Ξ³ < 1.0) for stable training
The frequency-informed masking strategy assigns higher masking probabilities to rare tokens. Implementation in masking/frequency_masking.py:
- Token Ranking: Tokens ranked by global corpus frequency
- Min-Max Normalization: Per-sequence normalization to [0,1]
- Softening: Weights raised to power p < 1 to prevent over-emphasis on extremely rare tokens
- Conditional Scaling: Ensures mean masking probability matches target rate (1 - Ξ±_t)
Optional curriculum learning progressively increases p from 0 to 0.02 across epochs.
Models are evaluated using the BabyLM Challenge evaluation pipeline with MLM pseudo-likelihood backend:
- Zero-shot: BLiMP, BLiMP Supplement, EWoK, Entity Tracking, COMPS
- Finetuning: GLUE and SuperGLUE subsets. For finetuning, we provide the
eval-utils/classifier.pyhelper file, with minor tensor shape changes to match our models. - Human-likeness: Reading task, Age of Acquisition, morphology tasks correlation with human performance
Evaluation can be run with or without time conditioning (see paper Section 3.2).
See paper for full results and ablations. We also include our new 512-seq-len model's results, trained with the Gaussian Bimodal noise schedule described in the paper:
Performance comparison on BabyLM Challenge zero-shot tasks:
| Task | Baseline (GPT-BERT) |
Submission (Cosine) |
Best (Bimodal Gaussian) |
|---|---|---|---|
| BLiMP | 80.5 | 76.9 | 78.2 |
| BLiMP Supplement | 73.0 | 72.4 | 73.6 |
| EWoK | 52.4 | 51.8 | 52.5 |
| COMPS | 59.7 | 56.4 | 56.6 |
| Entity Tracking | 39.9 | 40.8 | 39.7 |
| Model | Downloads |
|---|---|
| Cosine - Submission | |
| Bimodal Gaussian - Best |
If you find this work useful for your research, please consider citing our paper:
@article{kosmopoulou2025masked,
title={Masked Diffusion Language Models with Frequency-Informed Training},
author={Kosmopoulou, Despoina and Georgiou, Efthymios and Dorovatas, Vaggelis and Paraskevopoulos, Georgios and Potamianos, Alexandros},
journal={arXiv preprint arXiv:2509.05056},
year={2025}
}This repo is based on work from:
- Simple and Effective Masked Diffusion Language Models
- GPT or BERT: why not both?
- Architecture adapted from LTG-BERT
.
βββ main.py # Main training entry point
βββ model.py # Transformer model with diffusion
βββ config/ # Configuration and arguments
β βββ arguments.py # CLI argument parsing
β βββ model_configuration.py # Model architecture config
βββ data/ # Data loading and processing
β βββ dataset.py # Dataset implementation
β βββ dataset_manager.py # Manage Datasets (loading)
β βββ dataset_utils.py # Utilities for data
βββ eval-utils
β βββ classifier_model.py # file to use for finetuning
βββ masking/ # Masking strategies
β βββ noise_schedules.py # Diffusion noise schedules
β βββ masking_processor.py # Token masking logic
β βββ frequency_masking.py # Frequency-informed masking
β βββ batch_processing.py # Batch preparation
βββ training/ # Training infrastructure
β βββ training_loop.py # Main training loop
β βββ ema.py # Exponential moving average
β βββ checkpoint_manager.py # Checkpoint saving
β βββ validation.py # Validation during training
β βββ model_setup.py # Model loading and optimizer setup
β βββ distributed_setup.py # DDP setup
βββ tokenization/ # Tokenization scripts
β βββ create_tokenizer.py # Train BPE tokenizer
β βββ tokenize_corpus.py # Tokenize datasets
βββ optimization/ # Optimizers
β βββ lamb.py # LAMB optimizer
βββ slurm-scripts/ # Scripts
βββ slurm-train.sh # SLURM job script
βββ launch-train.sh # Local launch script
βββ config-cosine.json # Cosine schedule config example
βββ config-gauss.json # Gaussian schedule config example
For questions or issues, please open a GitHub issue or contact:
