Skip to content

Demon702/WorldValuesBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

World Values Benchmark Dataset

WorldValuesBench is a global-scale benchmark dataset for studying multi-cultural human value awareness of language models, derived from an impactful social science project called the World Values Survey (WVS) Wave 7.

Dataset Creation

Step 1: Download Raw Data

Due to licensing issues, we can't distribute the raw data. But you can easily download it from the official website.

  • Navigate to the WVS Wave 7 website.

  • Go to the Statistical Data Files section and click on WVS Cross-National Wave 7 csv v6 0.zip

  • Fill out the form and the raw csv data will be automatically downloaded.

Step 2: Reproduce WorldValuesBench

python dataset_construction/data_preparation.py --raw-dataset-path <path_to_raw_csv_downloaded_in_step_1>

This parses the raw CSV and creates the dataset inside WorldValuesBench folder. It also splits the data into train, valid and test splits which you can leverage for your experiments. Refer to this section for more details about the generated dataset.

NOTE: We created our data processing pipeline according to WVS Wave 7 v5.0. WVS has since released Wave 7 v6.0. Our data processing pipeline is perfectly compatible with Wave 7 v6.0. The answer data for the questions that we used for the experiments in our paper are also unchanged from Wave 7 v5.0 to v6.0. However, v6.0 has some extra questions which were not present in v5.0. If you want to analyze those questions, please feel free to edit the codebook and question_metadata accordingly.

Task

A safe and personalized language model should be aware of multi-cultural values and the answer distribution that people from various backgrounds may provide.

We study the task of (demographic attributes, value question) -> answer. For example,

((US, Rural, Bachelor, ...), "On a scale of 1 to 4, 1 meaning 'Very important' and 4 meaning 'Not at all important', how important is leisure time in your life?") -> 3

An example prompt to the model is as follows:

Person X provided the following demographic information in an interview:
1. Question: In what country was the interview conducted?
 Answer: ...
2. Question: What is the type of settlement in which the interview was conducted? Urban or Rural?
 Answer: ...
3. Question: What is the highest educational level that you have attained?
 Answer: ...
...

What would Person X answer to the following question and why?
Question: On a scale of 1 to 4, 1 meaning 'Very important' and 4 meaning 'Not at all important',
how important is leisure time in your life?

Your output should be in the following format:
{
  "thoughtful explanation": "... upto 30 words (keep their demographics in mind) ...",
  "answer as a score": "... score ..."
}

Repository Overview

dataset_construction

This directory contains a few intermediate and potentially reusable files that are generated during our dataset construction procedure.

  • data_preparation.py: It contains the end-to-end code to process the raw data and produce our structured dataset present under WorldValuesBench, including the full, train, valid, and test splits.
  • question_metadata.json: It contains useful metadata for each question present in the dataset.
  • codebook.json: It contains the mapping between the numerical answer in raw data and the natural language answer according to the WVS Codebook Variables report that is available at the World Values Survey (WVS) Wave 7 webpage.
  • answer_adjustment.json: It contains the mapping required for some of the answer choices to make them monotonic and ordinal.
  • probe_set_construction:
    • probe_set_prepration.py: It contains the code to construct the probe data split that we used to run experiments in our paper. You don't need to run this, we have already provided the split in samples.tsv.

WorldValuesBench

After you have created the dataset by following the instructions in Dataset Creation, this directory will contain the benchmark dataset.

  • full This contains data for all participants present in the survey.
    • full_demographic_qa.tsv
      • Each row is a participant with a unique D_INTERVIEW identifier.
      • Each column is a demographic question/variable that can be studied.
      • The sheet has text answers of 93278 participants (rows) for 42 demographic questions (columns).
    • full_value_qa.tsv
      • Each row is a participant with a unique D_INTERVIEW identifier.
      • Each column is a value question that can be studied.
      • The sheet has text answers of 93278 participants (rows) for 240 value questions (columns).
  • train: similar to the above, 65294 (70%) participants
  • valid: similar to the above, 13991 (15%) participants
  • test: similar to the above, 13993 (15%) participants
  • probe: we use a subset of the valid set examples to conduct a case study in our paper, with a focus on 36 value questions and 3 demographic variables.
    • 36 value questions: listed in value_questions.json. Are models aware of the human answer distribution for each value question?
    • 3 demographic variables. A demographic group is a group of people that share a combination of demographic attributes, e.g., (North America, rural, college degree). Are models aware of the answer distributions of each demographic group?
      • 6 continents: We map each B_COUNTRY (the country where the survey took place) in the survey to a continent using country2continent.csv, due to computational constraints and the fact that there are many countries. Future study could be more extensively conducted on all countries.
      • urban vs. rural: H_URBRURAL in the survey.
      • 4 education levels: mapped from Q275 according to education2level.csv.
    • samples.tsv: each row corresponds to an example of type (demographic attributes, value question) -> answer.
      • We use stratified sampling. For each value question, for each demographic group, we randomly sample 5 participants, using their demographic attributes and answers to the value question in the case study. Two demographic groups don't have survey participants. So in total we have 36 x (6 x 2 x 4 - 2) x 5 = 8280 samples.
      • The Question column contains the IDs of the value questions that we study and can be used to retrieve the question from value_questions.json or question_metadata.json.
      • The Continent, Urban / Rural, Education columns cluster participants into demographic groups that we study.
      • The D_INTERVIEW column can be used to uniquely identify a participant and retrieve the participant demographic attributes and answer to the question from the valid set tsv files.
Split #participants #examples
train 65,294 15,042,191
valid 13,993 3,225,712
test 13,991 3,224,490
full 93,278 21,492,393
probe 4,860 8,280

evaluation

This directory contains our evaluation script and visualizations.

  • evaluate.py: It contains our evaluation script. It calculates the Earth Mover's Distance or the Wasserstein-1 distance between the ground truth distribution and model distribution for each question. It expects the input to be a csv / tsv file with the following three columns

    • QUESTION_ID (The question key or ID from the dataset)
    • PARTICIPANT_ID (The participant ID or D_INTERVIEW column from the dataset)
    • SCORE (Model output for participant PARTICIPANT_ID and question QUESTION_ID)

    You can provide any subset of questions and participants in the input file in any order. A sample input file looks like the following.

    QUESTION_ID PARTICIPANT_ID SCORE
    Q1 156070594 4
    Q1 586070648 4
    Q2 608070097 3
    Q2 702070734 4
    Q3 410070959 3
    Q3 364072161 3

    Our evaluation script outputs a dictionary and optionally writes it to a json file (if the --output-file flag is set). Each key is a QUESTION_ID and each value is the Earth Mover's Distance between the model prediction distribution and the ground truth distribution for that question. A sample output json looks like the following:

     {
     	"Q1": 1.0,
     	"Q2": 0.6666666666666667,
     	"Q3": 0.6666666666666666
     }
    

    To run evaluation, run

    python evaluate.py --input-file <input-file> --output-file <path_to_write_output_json>

    You can also specify the input file separator if it's not a comma. For example, to run evaluation on our GPT-3.5 model (with demographic) outputs, run

    python evaluate.py --input-file model_outputs/gpt-3.5_with_demographics.tsv --output-file gpt-3.5_with_demographics-output.json --input-file-separator \\t

  • evaluation_and_plot.ipynb: Jupyter notebook that has all the visualization in the paper.

  • model_outputs: All the model outputs from our experiments for reproducibility and to facilitate further research.

Citation

Wenlong Zhao*, Debanjan Mondal*, Niket Tandon, Danica Dillion, Kurt Gray, and Yuling Gu. 2024. WorldValuesBench: A Large-Scale Benchmark Dataset for Multi-Cultural Value Awareness of Language Models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17696–17706, Torino, Italia. ELRA and ICCL.

Haerpfer, C., Inglehart, R., Moreno, A., Welzel, C., Kizilova, K., Diez-Medrano J., M. Lagos, P. Norris, E. Ponarin & B. Puranen (eds.). 2022. World Values Survey: Round Seven – Country-Pooled Datafile Version 6.0. Madrid, Spain & Vienna, Austria: JD Systems Institute & WVSA Secretariat. doi:10.14281/18241.24

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published