Skip to content

[BabyLM@EMNLP 2025 - Challenge Award] Official Implementation: Masked Diffusion Language Models with Frequency-Informed Training

License

Notifications You must be signed in to change notification settings

DespoinaKK/babylm-diffusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

10 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

🎭 Masked Diffusion Language Models
with Frequency-Informed Training

⭐ Winners of the Strict Track (NLP tasks) of the BabyLM Challenge 2025

Oral presentation - BabyLM Workshop @ EMNLP 2025

πŸ“„ Paper β€’ πŸ€— Models β€’ πŸ’» Code


Official PyTorch implementation of "Masked Diffusion Language Models with Frequency-Informed Training"

🌟 Overview

This repository implements a masked diffusion language modeling framework for data-efficient training under strict data constraints. Our approach combines:

  • Masked Diffusion Language Models (MDLMs): A discrete diffusion approach that uses masked language modelling (bidirectional context) to generate text
  • Novel Noise Schedules: Bimodal Gaussian (our best model) and cosine (submission's model)
  • Frequency-Informed Masking: Progressive prioritization (curriculm) of rare tokens during training, which integrates seamlessly with the MDLM framework
  • NELBO Reweighting: Exploration of different weighting schemes to optimize performance across schedules

Noise Schedule Evolution

🎯 Results

Our method achieves competitive performance with state-of-the-art baselines (GPT-BERT) on the BabyLM benchmarks, demonstrating that diffusion-based training offers a viable alternative for data-restricted language learning.

πŸš€ Installation

# Clone the repository
git clone https://github.com/DespoinaKK/babylm-diffusion
cd babylm-diffusion

# Install dependencies
pip install -r requirements.txt

⚑ Quick Start

1. Data Preparation

For our models, we use the BabyLM corpus, for the strict track (100M words). You can find more information here.

First, train a BPE tokenizer on your corpus:

python tokenization/create_tokenizer.py \
    --input_path /path/to/data \
    --vocab_size 16384 \
    --vocab_path ./tokenizers/tokenizer.json

Then tokenize your dataset:

python tokenization/tokenize_corpus.py \
    --data_folder /path/to/data \
    --train_file data.train \
    --valid_file data.valid \
    --tokenizer_folder ./tokenizers \
    --tokenizer_file tokenizer.json \
    --name tokenized

2. Training

For distributed training, adapt the scripts in slurm-scripts. Our models are trained on a single node with 4 A100 64GB gpus (you can adapt the setup to utilize only 1 gpu, modifying the slurm*.sh scripts). To specify the hyperparameters and noise schedules you want, modify the config files in slurm-scripts.

πŸ”‘ Key Components

🌊 Noise Schedules

Cosine Schedule (masking/noise_schedules.py):

  • Focuses on lower masking rates
  • Average masking rate: 0.36

Gaussian Schedules:

  • Unimodal
  • Bimodal: Mixture distribution combining low and high masking modes
  • Requires derivative softening (Ξ³ < 1.0) for stable training

πŸ“Š Frequency-Informed Masking

The frequency-informed masking strategy assigns higher masking probabilities to rare tokens. Implementation in masking/frequency_masking.py:

  1. Token Ranking: Tokens ranked by global corpus frequency
  2. Min-Max Normalization: Per-sequence normalization to [0,1]
  3. Softening: Weights raised to power p < 1 to prevent over-emphasis on extremely rare tokens
  4. Conditional Scaling: Ensures mean masking probability matches target rate (1 - Ξ±_t)

Optional curriculum learning progressively increases p from 0 to 0.02 across epochs.

πŸ“ˆ Evaluation

Models are evaluated using the BabyLM Challenge evaluation pipeline with MLM pseudo-likelihood backend:

  • Zero-shot: BLiMP, BLiMP Supplement, EWoK, Entity Tracking, COMPS
  • Finetuning: GLUE and SuperGLUE subsets. For finetuning, we provide the eval-utils/classifier.py helper file, with minor tensor shape changes to match our models.
  • Human-likeness: Reading task, Age of Acquisition, morphology tasks correlation with human performance

Evaluation can be run with or without time conditioning (see paper Section 3.2).

🎯 Updated Results - Bimodal Gaussian Schedule

See paper for full results and ablations. We also include our new 512-seq-len model's results, trained with the Gaussian Bimodal noise schedule described in the paper:

Performance comparison on BabyLM Challenge zero-shot tasks:

Task Baseline
(GPT-BERT)
Submission
(Cosine)
Best
(Bimodal Gaussian)
BLiMP 80.5 76.9 78.2
BLiMP Supplement 73.0 72.4 73.6
EWoK 52.4 51.8 52.5
COMPS 59.7 56.4 56.6
Entity Tracking 39.9 40.8 39.7

πŸ€— Hugginface Pretrained Models

Model Downloads
Cosine - Submission
Bimodal Gaussian - Best

πŸ“ Citation

If you find this work useful for your research, please consider citing our paper:

@article{kosmopoulou2025masked,
  title={Masked Diffusion Language Models with Frequency-Informed Training},
  author={Kosmopoulou, Despoina and Georgiou, Efthymios and Dorovatas, Vaggelis and Paraskevopoulos, Georgios and Potamianos, Alexandros},
  journal={arXiv preprint arXiv:2509.05056},
  year={2025}
}

πŸ“š References

This repo is based on work from:

πŸ“ Code Structure

.
β”œβ”€β”€ main.py                      # Main training entry point
β”œβ”€β”€ model.py                     # Transformer model with diffusion
β”œβ”€β”€ config/                      # Configuration and arguments
β”‚   β”œβ”€β”€ arguments.py            # CLI argument parsing
β”‚   └── model_configuration.py  # Model architecture config
β”œβ”€β”€ data/                        # Data loading and processing
β”‚   β”œβ”€β”€ dataset.py              # Dataset implementation
β”‚   β”œβ”€β”€ dataset_manager.py      # Manage Datasets (loading)
β”‚   └── dataset_utils.py        # Utilities for data
β”œβ”€β”€ eval-utils
β”‚   └── classifier_model.py      # file to use for finetuning
β”œβ”€β”€ masking/                     # Masking strategies
β”‚   β”œβ”€β”€ noise_schedules.py      # Diffusion noise schedules
β”‚   β”œβ”€β”€ masking_processor.py    # Token masking logic
β”‚   β”œβ”€β”€ frequency_masking.py    # Frequency-informed masking
β”‚   └── batch_processing.py     # Batch preparation
β”œβ”€β”€ training/                    # Training infrastructure
β”‚   β”œβ”€β”€ training_loop.py        # Main training loop
β”‚   β”œβ”€β”€ ema.py                  # Exponential moving average
β”‚   β”œβ”€β”€ checkpoint_manager.py   # Checkpoint saving
β”‚   β”œβ”€β”€ validation.py           # Validation during training
β”‚   β”œβ”€β”€ model_setup.py          # Model loading and optimizer setup
β”‚   └── distributed_setup.py    # DDP setup
β”œβ”€β”€ tokenization/                # Tokenization scripts
β”‚   β”œβ”€β”€ create_tokenizer.py     # Train BPE tokenizer
β”‚   └── tokenize_corpus.py      # Tokenize datasets
β”œβ”€β”€ optimization/                # Optimizers
β”‚   └── lamb.py                 # LAMB optimizer
└── slurm-scripts/               # Scripts
    β”œβ”€β”€ slurm-train.sh          # SLURM job script
    β”œβ”€β”€ launch-train.sh         # Local launch script
    β”œβ”€β”€ config-cosine.json      # Cosine schedule config example 
    └── config-gauss.json       # Gaussian schedule config example

πŸ“§ Contact

For questions or issues, please open a GitHub issue or contact:

About

[BabyLM@EMNLP 2025 - Challenge Award] Official Implementation: Masked Diffusion Language Models with Frequency-Informed Training

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •