Hanjia Lyu1, Jiebo Luo1, Jian Kang1, Allison Koenecke2
1 University of Rochester
2 Cornell University
Accepted for publication in FAccT 2025
Will also be presented at IC2S2 2025 as a Parallel Talk
- Introduction
- Example Usage
- Prompts of Our Benchmark Dataset
- Datasets We Used to Create Our Prompts
- Requirements
- Reproducibility
- Figure 2
- Figure 3
- Figure 4
- Figure 5
- Figure 6
- Figure 7
- Figure 8
- Figure 9
- Figure 10
- Figure 11
- Tables 2 & 21
- Table 11
- Table 13
- Tables 14-17
- Table 18
- Table 19
- Table 20
- Table 22
- Table 23
- Table 24
- Table 25
- Table 26
- Table 27
- Table 28
- Table 29
- Table 30
- Table 31
- Table 32
- Table 33
- Table 34
- Table 35
- Table 36
- Table 37
- Citation
While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring.
To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese).
For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models---spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese.
- Prompt GPT-4o in Simplified Chinese to perform the regional term choice task
python infer.py --llm gpt4o --task term --lang simplified --prompt_id 1- Prompt Qwen in Traditional Chinese to perform the regional name choice task
python infer.py --llm qwen --task name --lang traditional --prompt_id 1- Prompt Llama3-70b in English to perform the regional name choice task
python infer.py --llm llama3-70b --task name --lang english --prompt_id 1- Use GPT-4o-mini to annotate the result of Qwen on the regional term task when prompted in English and the prompt_id is 2
python gpt_eval.py --llm qwen --lang english --task term --prompt_id 2- Use GPT-4o-mini to annotate the result of Breeze on the regional name task when prompted in Traditional Chinese and the prompt_id is 1
python gpt_eval.py --llm breeze --lang traditional --task name --prompt_id 1prompt/regional_term/{language}_{prompt_id}.csv
These datasets contain the prompts of the regional term choice task. prompt_id represents the prompt version.
prompt/regional_name/{language}_{prompt_id}.csv
These datasets contain the prompts of the regional name choice task. prompt_id represents the research questions in Section 4. prompt_id_0: Section 4.1, prompt_id_1: Section 4.3.1, prompt_id_2: Section 4.6, prompt_id_3: Section 4.4, prompt_id_4: Section 4.5.
source_data/regional_term_and_definition.csv
This dataset includes all 110 regional terms, along with their definitions and their usage in the contexts of Mainland China and Taiwan.
source_data/regional_name_and_characteristics.csv
This dataset includes all 352 regional names, along with their population-based popularity decile assignments and their gender labels---predicted for Mainland Chinese names and reported for Taiwanese names.
- Clone this repository
git clone https://github.com/brucelyu17/SC-TC-Bench.git- Create a conda virtual environment and activate it
conda create -n sc-tc-bench python=3.10
source activate sc-tc-bench- Install packages
pip install -r requirements.txt- To run inference with ChatGLM2, install
transformers==4.40.0
python -m reproducibility.fig_2 --prompt_id 1python -m reproducibility.fig_3 --prompt_id 0python -m reproducibility.fig_3 --prompt_id 2 --arrowpython -m reproducibility.fig_2 --prompt_id 1 --no_gptpython -m reproducibility.fig_6python -m reproducibility.fig_2 --prompt_id 2python -m reproducibility.fig_2 --prompt_id 3python -m reproducibility.fig_9python -m reproducibility.fig_3 --prompt_id 1python -m reproducibility.fig_3 --prompt_id 3 --arrowpython -m reproducibility.tab_2python -m reproducibility.tab_11 --oncepython -m reproducibility.tab_11python -m reproducibility.fig_2 --prompt_id 1python -m reproducibility.tab_18python -m reproducibility.tab_19python -m reproducibility.tab_20python -m reproducibility.tab_22python -m reproducibility.tab_23python -m reproducibility.tab_24 --lang simplified --overallpython -m reproducibility.tab_24 --lang traditional --overallpython -m reproducibility.tab_24 --lang english --overallpython -m reproducibility.tab_24 --lang simplified --genderpython -m reproducibility.tab_24 --lang traditional --genderpython -m reproducibility.tab_24 --lang english --genderpython -m reproducibility.tab_30python -m reproducibility.tab_30 --llm baichuan2python -m reproducibility.tab_30 --llm qwenpython -m reproducibility.tab_31 --llm baichuan2 --name_example
python -m reproducibility.tab_31 --llm qwen --name_examplepython -m reproducibility.tab_34python -m reproducibility.tab_35python -m reproducibility.tab_36python -m reproducibility.tab_37@inproceedings{sctcbench-facct25,
title={Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese},
author={Lyu, Hanjia and Luo, Jiebo and Kang, Jian and Koenecke, Allison},
year={2025},
isbn = {9798400714825},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3715275.3732182},
doi = {10.1145/3715275.3732182},
booktitle = {Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency},
location = {Athens, Greece},
series = {FAccT '25}
}

