Skip to content

Code and data from the Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers paper

License

Notifications You must be signed in to change notification settings

genesith/ImprobableBigrams

Repository files navigation

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Repository containing code and results for our EMNLP 2025 paper on incomplete tokens in byte-level tokenizers.

Citation

If you found these ideas useful, please cite the following paper:

Eugene Jang, Kimin Lee, Jin-Woo Chung, Keuntae Park, and Seungwon Shin. 2025. Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 18220–18227, Suzhou, China. Association for Computational Linguistics.

About

Code and data from the Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers paper

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published