This repository contains the code and data for our NeurIPS 2025 paper:
For Better or for Worse, Transformers Seek Patterns for Memorization
Madhur Panwar, Gail Weiss, Navin Goyal, Antoine Bosselut
@inproceedings{
panwar2025for,
title={For Better or for Worse, Transformers Seek Patterns for Memorization},
author={Madhur Panwar and Gail Weiss and Navin Goyal and Antoine Bosselut},
booktitle={The Thirty-ninth Annual Conference on Neural Information Processing Systems},
year={2025},
url={https://openreview.net/forum?id=98NrkXPRZ9}
}conda env create -f neurips.yml
For synthetic data, we use custom tokenizers depending on whether the data requires only digits, or digits and letters. The tokenizer files necessary to use them in code as well as the code to create them is placed in tokenizer_local directory.
Synthetic datasets as well as the code to generate them is placed under synthetic_data directory. To generate synthetic datasets, run python generate_synthetic_data.py.
For running WikiText experiments, change appropriate parameters and run run_wikitext.sh:
./run_wikitext.sh
For running experiments with synthetic datasets, change appropriate parameters and run run_synthetic.sh:
./run_synthetic.sh
Code for the plots in the paper is in the notebook plots.ipynb.