Improving Symbolic Music Generation with Inference-Time Alignment
Text2midi-InferAlign is an inference-time technique that enhances symbolic music generation by improving alignment between generated compositions and textual prompts. It is designed to extend autoregressive models—like Text2Midi—without requiring any additional training or fine-tuning.
Our method introduces two lightweight but effective alignment-based objectives into the generation process:
- 🎵 Text-Audio Consistency: Encourages the temporal structure of the music to reflect the rhythm and pacing implied by the input caption.
- 🎵 Harmonic Consistency: Penalizes musically inconsistent notes (e.g., out-of-key or dissonant phrases), promoting tonal coherence.
By incorporating these alignment signals into the decoding loop, Text2midi-InferAlign produces music that is not only more faithful to textual descriptions but also harmonically robust.
We evaluate our technique on Text2Midi, a state-of-the-art text-to-MIDI generation model, and report improvements in both objective metrics and human evaluations.
This repository contains the implementation of the Inference-Time Alignment module. Follow the steps below to get started.
git clone https://github.com/AMAAI-Lab/t2m-inferalign.git
cd t2m-inferalignWe recommend using Python 3.10 and conda for environment management.
conda create -n alignment python=3.10
conda activate alignment
pip install -r requirements.txtPlease export your API key.
export ANTHROPIC_API_KEY=<your key>or you can set your key here.
-
Download the pretrained Text2Midi model from HuggingFace:
🔗 https://huggingface.co/amaai-lab/text2midi -
Also download the corresponding tokenizer and soundfonts:
🔗 https://huggingface.co/amaai-lab/text2midi/tree/main/
You may choose to organize them like this:
t2m-inferalign/
├── checkpoints/
│ └── pytorch_model.bin
├── tokenizer/
│ └── vocab_remi.pkl
├── soundfonts/
│ └── soundfont.sf2
Please fix the soundfont path here or here.
python progressive_explorer.py --caption "A gentle piano lullaby with soft melodies" --model_path checkpoints/pytorch_model.bin --tokenizer_path tokenizer/vocab_remi.pkl --output_path outputs/lullaby.midOptional arguments:
--max_tokens: Max number of tokens in the generated sequence.--batch_size: Number of tokens to generate before checking rewards.--beams: Number of parallel sequences to generate.
We evaluate on the MidiCaps dataset using six standard metrics. Our approach outperforms the Text2Midi baseline in all key alignment and tonal consistency metrics.
| Metric | Text2Midi | Text2midi-InferAlign |
|---|---|---|
| CR (Compression Ratio) ↑ | 2.16 | 2.31 |
| CLAP (Text-Audio Consistency) ↑ | 0.17 | 0.22 |
| TB (Tempo Bin %) ↑ | 29.73 | 35.41 |
| TBT (Tempo Bin w/ Tolerance %) ↑ | 60.06 | 62.59 |
| CK (Correct Key %) ↑ | 13.59 | 29.80 |
| CKD (Correct Key w/ Duplicates %) ↑ | 16.66 | 32.54 |
All results are averaged over the MidiCaps test set.
A user study was conducted with 24 participants, comparing outputs from Text2Midi and Text2midi-InferAlign. Participants rated musical quality and text-audio alignment.
| Evaluation Criteria | Text2Midi (%) | Text2midi-InferAlign (%) |
|---|---|---|
| Music Quality | 31.25 | 68.75 |
| Text-Audio Match | 41.67 | 58.33 |
| Caption Type | Text2Midi (%) | Text2midi-InferAlign (%) |
|---|---|---|
| MidiCaps Caption | 48.33 | 51.67 |
| Free Text Caption | 27.78 | 72.22 |
These results demonstrate that Text2midi-InferAlign significantly enhances both musical structure and semantic relevance, especially for free-form, open-ended prompts.
If you find this work useful in your research, please cite:
@article{text2midi-inferalign,
title={Text2midi-InferAlign: Improving Symbolic Music Generation with Inference-Time Alignment},
author={Abhinaba Roy, Geeta Puri, Dorien Herremans},
year={2025},
journal={arXiv:2505.12669}
}