Data and code for the following papers:
ACL'24 Findings (Full-Length Paper) Teaching Large Language Models an Unseen Language on the Fly
ICLR'24 Tiny Paper Can LLMs Learn a New Language on the Fly? A Case Study on Zhuang
We present ZhuangBench, a collection of NLP resources for Zhuang (壮语), a low-resource language spoken in China.
It consists of a Zhuang-Chinese dictionary, a Zhuang-Chinese parallel corpus, and Zhuang-Chinese machine translation test set.
Important: Preventing Test Set Contamination
We encrypted the source files of ZhuangBench in data.zip to prevent test set contamination.
The password is zhuangbench.
List of files:
dictionary_za2zh.jsonl: Zhuang-Chinese dictionary.dictionary_zh2za.jsonl: Chinese-Zhuang dictionary.parallel_corpus.json: Zhuang-Chinese parallel corpus.test_translation_set.json: Zhuang-Chinese machine translation test set.preprocessed/dictionary_za2zh_web+giza.jsonl: Zhuang-Chinese dictionary augmented with BLI from Giza++.preprocessed/dictionary_zh2za_web+giza+synonym.jsonl: Chinese-Zhuang dictionary augmented with BLI from Giza++ and synonyms.
Our ICLR'24 Tiny Paper uses a beta version of the dataset, ZhuangBench-Beta. We provide the data in data-beta-version.zip (password: zhuangbench-beta).
This data is for archival purposes only. We recommend using the newer data in data.zip, which is larger and includes typo corrections.
We provide code of DiPMT++ to reproduce the results in the paper.
Install the dependencies:
pip install -r requirements.txtUse the scripts in ./scripts to run the LLMs and evaluate the results.
The license for the code and data is MIT.
@inproceedings{zhang2024teaching,
title={Teaching Large Language Models an Unseen Language on the Fly},
author={Zhang, Chen and Liu, Xiao and Lin, Jiuheng and Feng, Yansong},
booktitle={Findings of the Association for Computational Linguistics ACL 2024},
pages={8783--8800},
year={2024}
}
@inproceedings{zhang2024can,
title={Can {LLM}s Learn a New Language on the Fly? A Case Study on Zhuang},
author={Chen Zhang and Mingxu Tao and Quzhe Huang and Zhibin Chen and Yansong Feng},
booktitle={The Second Tiny Papers Track at ICLR 2024},
year={2024},
}
