Bolmo

The first family of competitive fully open byte-level language models.

Bolmo is the first fully-open byte-level language model achieving performance on the level of state-of-the-art subword-level language models. Unlike traditional language models that rely on subword tokenizers (like BPE or WordPiece), Bolmo operates directly on raw UTF-8 bytes, making it:

Free of subword tokenization: No need for language-specific tokenizers or vocabulary management.
Universally applicable: Works seamlessly across all languages, scripts, and domains.
Fully open: Complete training code, model weights, data processing pipeline, and paper.
Competitive performance: Comes close to matching (and in some cases exceeds) subword-based state-of-the-art models across a wide range of tasks.
Better character understanding: Superior performance on tasks requiring character-level knowledge.

See our technical report for details: https://allenai.org/papers/bolmo.

This repository is a fork of OLMo-core that implements the complete Bolmo architecture and training pipeline through byteifying - our approach to converting existing subword models to byte-level models, using <1% of the pretraining budget.

Models

We release Bolmo models in two sizes:

Model	Parameters	Base Model	HuggingFace
Bolmo-7B	7.6B	Olmo 3 7B	allenai/Bolmo-7B
Bolmo-1B	1.5B	OLMo 2 1B	allenai/Bolmo-1B

Training data is available via HuggingFace at allenai/bolmo_mix.

Installation

First install PyTorch according to the instructions specific to your operating system and hardware.

Using UV (Recommended)

git clone https://github.com/allenai/bolmo-core.git
cd bolmo-core
uv venv --python 3.12.12
. .venv/bin/activate
uv sync --frozen --extra xlstm --extra wandb

This will install the precise version used for development of all packages (as saved in uv.lock). Installation should typically take less than five minutes.

Using pip

Alternatively, via pip:

pip install -e .[xlstm,wandb]

Optional Dependencies

For full functionality, you may need:

flash-attn for efficient attention
TransformerEngine for optimized training
xlstm for xLSTM components (mLSTM layers used in the Bolmo local models)
Liger-Kernel for low-memory loss implementations

See the OLMo-core documentation for complete installation details.

We tested Bolmo installation on Ubuntu 24.04 and Rocky Linux release 8.10.

Demo

Inference with HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer

device = "cuda"
bolmo = AutoModelForCausalLM.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True).to(device)
tokenizer = AutoTokenizer.from_pretrained("allenai/Bolmo-7B", trust_remote_code=True)

message = ["Language modeling is "]
input_ids = tokenizer(message, return_tensors="pt")["input_ids"].to(device)

# `max_new_tokens` is the amount of bytes to generate
response = bolmo.generate(input_ids, max_new_tokens=256, do_sample=True, temperature=0.1)
print(tokenizer.decode(response[0], skip_special_tokens=True))

This should quickly generate a completion on a standard GPU (usually <5min, including downloading the weights from the internet). It should generate something like:

Language modeling is a fundamental task in natural language processing (NLP), aiming to predict the probability of a sequence of words occurring in a given context. This prediction is crucial for various applications, including speech recognition, machine translation, and text generation. [...]

HuggingFace checkpoints vs. olmo-core checkpoints

This codebase uses the olmo-core checkpoint format. Bolmo models can be converted from this format to the HuggingFace format via:

python3 src/examples/huggingface/convert_checkpoint_to_hf.py \
    -i /path/to/bolmo/checkpoint \
    -o /path/to/bolmo/checkpoint/in/hf/format \
    -s 65536 \ # max sequence length
    --dtype float32 \
    --skip-validation

Converting from HF format back to olmo-core is not implemented at the moment. However, we provide the original olmo-core checkpoints for Bolmo 1B and Bolmo 7B in the olmo_core/ subdirectory on HF: 1B, 7B.

Training

Bolmo training uses a two-stage byteifying procedure to convert existing subword models to byte-level:

Stage 1: Subword-to-Byte Distillation

Quickly learn weights for local models while freezing the global model (9.8B tokens ≈ 43B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage1_*.

To run the Stage 1 training script, first prepare your training data (for example, the bolmo_mix dataset) via the Dolma toolkit, see here for details.

Stage 2: End-to-End Training

Train the entire model to utilize byte-level information (39.3B tokens ≈ 173B bytes). Training scripts for this stage are available at bolmo_scripts/launch_stage2_*.

To run the Stage 1 training script, first prepare your training data (for example, the bolmo_mix dataset) via the Dolma toolkit, see here for details.

Post-Training via Task Arithmetic

Existing post-trained checkpoints can be byteified without additional training using Task Arithmetic:

python3 src/examples/bolmo/instructify.py \
    --output=/path/to/output/ \
    --checkpoint-dir=/path/to/bolmo/checkpoint \
    --base-checkpoint-dir=/path/to/base-olmo/checkpoint \
    --instruct-checkpoint-dir=/path/to/post-trained-olmo/checkpoint \
    --alpha=1.0

Evaluation

Bolmo 7B Results

Bolmo 7B matches or exceeds the performance of state-of-the-art byte-level models and comes close to the source Olmo 3 7B model:

Category	Bolmo 7B	Olmo 3 7B	BLT 7B
Character Understanding (CUTE)	78.6	56.9	52.3
Multilingual Char (EXECUTE)	71.6	55.1	46.3
Code	41.0	40.1	31.6
Math	48.9	55.3	15.7
MC Stem	65.5	66.3	49.0
MC Non-Stem	75.8	77.7	56.6
GenQA	70.9	72.4	68.4

Full evaluation results available in the paper.

Reproducing Evaluations

We use olmes for all evaluations.

You can evaluate Bolmo like any subword LLM, except for one Gotcha: for loglikelihood evals, we need to fold the boundary + byte logits (second half of the vocabulary) into the byte logits (first half of the vocabulary):

probs = F.softmax(logits.float(), dim=-1)
probs[..., :self.vocab_size_bolmo] += probs[..., self.vocab_size_bolmo:self.vocab_size_bolmo*2]
logits = torch.log(probs)
logits[..., self.vocab_size_bolmo:self.vocab_size_bolmo*2] = -100_000

The fork of OLMES at http://github.com/bminixhofer/olmes implements this option. You can evaluate Bolmo like this for example:

olmes \
    --model allenai/Bolmo-1B \
    --model-args '{"max_length": 16384, "trust_remote_code": "true", "model_type": "hf_bolmo", "add_bos_token": "true"}' \
    --task bolmo1b \
    --batch-size 32 \
    --output-dir workspace

This will reproduce the evals for Bolmo 1B. See also #6. The olmes task suites for Bolmo 1B and Bolmo 7B are defined here: link.

Citation

To cite Bolmo:

@misc{bolmo,
      title={Bolmo: Byteifying the Next Generation of Language Models}, 
      author={Benjamin Minixhofer and Tyler Murray and Tomasz Limisiewicz and Anna Korhonen and Luke Zettlemoyer and Noah A. Smith and Edoardo M. Ponti and Luca Soldaini and Valentin Hofmann},
      year={2025},
      eprint={2512.15586},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2512.15586}, 
}

For the underlying OLMo-core framework:

@misc{olmo20242olmo2furious,
  title={{2 OLMo 2 Furious}},
  author={{Team OLMo} and Pete Walsh and Luca Soldaini and Dirk Groeneveld and Kyle Lo and Shane Arora and Akshita Bhagia and Yuling Gu and Shengyi Huang and Matt Jordan and Nathan Lambert and Dustin Schwenk and Oyvind Tafjord and Taira Anderson and David Atkinson and Faeze Brahman and Christopher Clark and Pradeep Dasigi and Nouha Dziri and Michal Guerquin and Hamish Ivison and Pang Wei Koh and Jiacheng Liu and Saumya Malik and William Merrill and Lester James V. Miranda and Jacob Morrison and Tyler Murray and Crystal Nam and Valentina Pyatkin and Aman Rangapur and Michael Schmitz and Sam Skjonsberg and David Wadden and Christopher Wilhelm and Michael Wilson and Luke Zettlemoyer and Ali Farhadi and Noah A. Smith and Hannaneh Hajishirzi},
  year={2024},
  eprint={2501.00656},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2501.00656},
}

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 1,208 Commits
.github		.github
bolmo_scripts		bolmo_scripts
docs		docs
src		src
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Bolmo

The first family of competitive fully open byte-level language models.

Models

Installation

Using UV (Recommended)

Using pip

Optional Dependencies

Demo

Inference with HuggingFace

HuggingFace checkpoints vs. olmo-core checkpoints

Training

Stage 1: Subword-to-Byte Distillation

Stage 2: End-to-End Training

Post-Training via Task Arithmetic

Evaluation

Bolmo 7B Results

Reproducing Evaluations

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Bolmo

The first family of competitive fully open byte-level language models.

Models

Installation

Using UV (Recommended)

Using pip

Optional Dependencies

Demo

Inference with HuggingFace

HuggingFace checkpoints vs. olmo-core checkpoints

Training

Stage 1: Subword-to-Byte Distillation

Stage 2: End-to-End Training

Post-Training via Task Arithmetic

Evaluation

Bolmo 7B Results

Reproducing Evaluations

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages