Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese

Hanjia Lyu¹, Jiebo Luo¹, Jian Kang¹, Allison Koenecke²

¹ University of Rochester

² Cornell University

Accepted for publication in FAccT 2025

Will also be presented at IC2S2 2025 as a Parallel Talk

Introduction
Example Usage
Prompts of Our Benchmark Dataset
Datasets We Used to Create Our Prompts
Requirements
Reproducibility
- Figure 2
- Figure 3
- Figure 4
- Figure 5
- Figure 6
- Figure 7
- Figure 8
- Figure 9
- Figure 10
- Figure 11
- Tables 2 & 21
- Table 11
- Table 13
- Tables 14-17
- Table 18
- Table 19
- Table 20
- Table 22
- Table 23
- Table 24
- Table 25
- Table 26
- Table 27
- Table 28
- Table 29
- Table 30
- Table 31
- Table 32
- Table 33
- Table 34
- Table 35
- Table 36
- Table 37
Citation

Introduction

While the capabilities of Large Language Models (LLMs) have been studied in both Simplified and Traditional Chinese, it is yet unclear whether LLMs exhibit differential performance when prompted in these two variants of written Chinese. This understanding is critical, as disparities in the quality of LLM responses can perpetuate representational harms by ignoring the different cultural contexts underlying Simplified versus Traditional Chinese, and can exacerbate downstream harms in LLM-facilitated decision-making in domains such as education or hiring.

To investigate potential LLM performance disparities, we design two benchmark tasks that reflect real-world scenarios: regional term choice (prompting the LLM to name a described item which is referred to differently in Mainland China and Taiwan), and regional name choice (prompting the LLM to choose who to hire from a list of names in both Simplified and Traditional Chinese).

For both tasks, we audit the performance of 11 leading commercial LLM services and open-sourced models---spanning those primarily trained on English, Simplified Chinese, or Traditional Chinese. Our analyses indicate that biases in LLM responses are dependent on both the task and prompting language: while most LLMs disproportionately favored Simplified Chinese responses in the regional term choice task, they surprisingly favored Traditional Chinese names in the regional name choice task. We find that these disparities may arise from differences in training data representation, written character preferences, and tokenization of Simplified and Traditional Chinese.

Example Usage

Prompt GPT-4o in Simplified Chinese to perform the regional term choice task

python infer.py --llm gpt4o --task term --lang simplified --prompt_id 1

Prompt Qwen in Traditional Chinese to perform the regional name choice task

python infer.py --llm qwen --task name --lang traditional --prompt_id 1

Prompt Llama3-70b in English to perform the regional name choice task

python infer.py --llm llama3-70b --task name --lang english --prompt_id 1

Use GPT-4o-mini to annotate the result of Qwen on the regional term task when prompted in English and the prompt_id is 2

python gpt_eval.py --llm qwen --lang english --task term --prompt_id 2

Use GPT-4o-mini to annotate the result of Breeze on the regional name task when prompted in Traditional Chinese and the prompt_id is 1

python gpt_eval.py --llm breeze --lang traditional --task name --prompt_id 1

Prompts of Our Benchmark Dataset

Regional Term Choice

prompt/regional_term/{language}_{prompt_id}.csv

These datasets contain the prompts of the regional term choice task. prompt_id represents the prompt version.

Regional Name Choice

prompt/regional_name/{language}_{prompt_id}.csv

These datasets contain the prompts of the regional name choice task. prompt_id represents the research questions in Section 4. prompt_id_0: Section 4.1, prompt_id_1: Section 4.3.1, prompt_id_2: Section 4.6, prompt_id_3: Section 4.4, prompt_id_4: Section 4.5.

Datasets We Used to Create Our Prompts

source_data/regional_term_and_definition.csv

This dataset includes all 110 regional terms, along with their definitions and their usage in the contexts of Mainland China and Taiwan.

source_data/regional_name_and_characteristics.csv

This dataset includes all 352 regional names, along with their population-based popularity decile assignments and their gender labels---predicted for Mainland Chinese names and reported for Taiwanese names.

Requirements

Clone this repository

git clone https://github.com/brucelyu17/SC-TC-Bench.git

Create a conda virtual environment and activate it

conda create -n sc-tc-bench python=3.10
source activate sc-tc-bench

Install packages

pip install -r requirements.txt

To run inference with ChatGLM2, install transformers==4.40.0

Reproducibility

Figure 2

python -m reproducibility.fig_2 --prompt_id 1

Figure 3

python -m reproducibility.fig_3 --prompt_id 0

Figure 4

python -m reproducibility.fig_3 --prompt_id 2 --arrow

Figure 5

python -m reproducibility.fig_2 --prompt_id 1 --no_gpt

Figure 6

python -m reproducibility.fig_6

Figure 7

python -m reproducibility.fig_2 --prompt_id 2

Figure 8

python -m reproducibility.fig_2 --prompt_id 3

Figure 9

python -m reproducibility.fig_9

Figure 10

python -m reproducibility.fig_3 --prompt_id 1

Figure 11

python -m reproducibility.fig_3 --prompt_id 3 --arrow

Tables 2 & 21

python -m reproducibility.tab_2

Table 11

python -m reproducibility.tab_11 --once

Table 13

python -m reproducibility.tab_11

Tables 14-17

python -m reproducibility.fig_2 --prompt_id 1

Table 18

python -m reproducibility.tab_18

Table 19

python -m reproducibility.tab_19

Table 20

python -m reproducibility.tab_20

Table 22

python -m reproducibility.tab_22

Table 23

python -m reproducibility.tab_23

Table 24

python -m reproducibility.tab_24 --lang simplified --overall

Table 25

python -m reproducibility.tab_24 --lang traditional --overall

Table 26

python -m reproducibility.tab_24 --lang english --overall

Table 27

python -m reproducibility.tab_24 --lang simplified --gender

Table 28

python -m reproducibility.tab_24 --lang traditional --gender

Table 29

python -m reproducibility.tab_24 --lang english --gender

Table 30

python -m reproducibility.tab_30

Table 31

python -m reproducibility.tab_30 --llm baichuan2

Table 32

python -m reproducibility.tab_30 --llm qwen

Table 33

python -m reproducibility.tab_31 --llm baichuan2 --name_example
python -m reproducibility.tab_31 --llm qwen --name_example

Table 34

python -m reproducibility.tab_34

Table 35

python -m reproducibility.tab_35

Table 36

python -m reproducibility.tab_36

Table 37

python -m reproducibility.tab_37

Citation

@inproceedings{sctcbench-facct25,
    title={Characterizing Bias: Benchmarking Large Language Models in Simplified versus Traditional Chinese},
    author={Lyu, Hanjia and Luo, Jiebo and Kang, Jian and Koenecke, Allison},
    year={2025},
    isbn = {9798400714825},
    publisher = {Association for Computing Machinery},
    address = {New York, NY, USA},
    url = {https://doi.org/10.1145/3715275.3732182},
    doi = {10.1145/3715275.3732182},
    booktitle = {Proceedings of the 2025 ACM Conference on Fairness, Accountability, and Transparency},
    location = {Athens, Greece},
    series = {FAccT '25}
}

Name		Name	Last commit message	Last commit date
Latest commit History 88 Commits
figure/regional_name		figure/regional_name
prompt		prompt
reproducibility		reproducibility
response		response
source_data		source_data
supplementary_data		supplementary_data
README.md		README.md
example.png		example.png
gpt_eval.py		gpt_eval.py
infer.py		infer.py
prompt.png		prompt.png
requirements.txt		requirements.txt
utils.py		utils.py

brucelyu17/SC-TC-Bench

Folders and files

Latest commit

History

Repository files navigation