Repository containing code and results for our EMNLP 2025 paper on incomplete tokens in byte-level tokenizers.
If you found these ideas useful, please cite the following paper:
Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18220–18227, Suzhou, China. Association for Computational Linguistics.