This is the repository for the data and analysis code of the WorldSense benchmark presented in the paper:
WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models, Youssef Benchekroun, Megi Dervishi, Mark Ibrahim Jean-Baptiste Gaya, Xavier Martinet, Grégoire Mialon, Thomas Scialom, Emmanuel Dupoux, Dieuwke Hupkes, Pascal Vincent. November 2023.
clonerepocdinto repopip install -r requirements.txt
From inside the repo run:
python analyse_results.py -t data/worldsense/test_setThis should produce the following output:
**************************************************************************
*
* ANALYSING RESULTS OF TESTSET data/worldsense/test_set
*
**************************************************************************
Loading trials data from data/worldsense/test_set/trials.jsonl.bz2
shape: (87048, 15)
- Loading results file data/worldsense/test_set/results/basic___GPT3.5___results.jsonl
shape: (87048, 4)
- Loading results file data/worldsense/test_set/results/basic___GPT4___results.jsonl
shape: (87048, 4)
- Loading results file data/worldsense/test_set/results/basic___Llama2-chat___results.jsonl
shape: (87048, 4)
- Loading results file data/worldsense/test_set/results/basic___Llama2-FT+1M___results.jsonl
shape: (87048, 4)
Assembling all into a large dataframe.
Computing accuracies and biases...
--------------------------------------------------------------------------
AVERAGE ACCURACY ACROSS ALL PROBLEMS (with 95% confidence interval)
--------------------------------------------------------------------------
prompting basic
modelname
GPT3.5 55.6 (0.4)
GPT4 75.6 (0.4)
Llama2-chat 56.2 (0.3)
Llama2-FT+1M 77.4 (0.4)
--------------------------------------------------------------------------
ACCURACY for each of the problems (with 95% confidence interval)
--------------------------------------------------------------------------
problemname Infer.trivial Infer.normal Consist.trivial Consist.normal Compl.trivial Compl.normal
prompting modelname
basic GPT3.5 64.8 (1.1) 55.1 (0.6) 52.2 (0.7) 49.9 (0.7) 59.9 (1.0) 51.8 (1.0)
GPT4 90.3 (0.7) 75.6 (0.6) 71.2 (0.6) 65.0 (0.6) 93.1 (0.6) 58.5 (0.7)
Llama2-chat 62.2 (0.8) 59.1 (0.5) 54.2 (0.4) 49.4 (0.3) 60.5 (0.7) 51.5 (0.5)
Llama2-FT+1M 79.7 (0.9) 80.6 (0.5) 54.9 (0.3) 52.8 (0.3) 98.5 (0.2) 97.8 (0.3)
--------------------------------------------------------------------------
BIAS for each of the problems (with 95% confidence interval)
--------------------------------------------------------------------------
problemname Infer.trivial Infer.normal Consist.trivial Consist.normal Compl.trivial Compl.normal
prompting modelname
basic GPT3.5 0.08 (0.02) -0.34 (0.01) -0.15 (0.01) -0.00 (0.01) -0.02 (0.02) 0.20 (0.02)
GPT4 -0.14 (0.01) -0.20 (0.01) -0.12 (0.01) 0.00 (0.01) 0.09 (0.01) 0.78 (0.01)
Llama2-chat -0.43 (0.02) -0.63 (0.01) 0.52 (0.01) 0.79 (0.01) 0.52 (0.02) 0.83 (0.01)
Llama2-FT+1M 0.26 (0.02) 0.04 (0.01) 0.84 (0.01) 0.90 (0.01) -0.02 (0.00) 0.02 (0.01)
The repo contains several test sets, all located inside data/worldsense/.
| Name | Testset directory | Description | Number of records (lines) |
|---|---|---|---|
| test_set | data/worldsense/test_set | The official WorldSense benchmark test set | 87048 |
| memorisation | data/worldsense/other_tests/memorisation | To test to what degree fine-tuned models memorized their training data | 11232 |
| ood-size | data/worldsense/other_tests/ood-size | standard problems but with size 6 | 9360 |
| ood-query | data/worldsense/other_tests/ood-query | additional problem named Infer.extrema to probe first, last relation | 6840 |
| ood-problem | data/worldsense/other_tests/ood-problem | additonal problem named Infer.parallel requiring 2-dimensional reasoning | 5400 |
Each record contains a question text, that can be given to a LLM to answer.
The official test_set contains questions for the 6 standard problem variants desribed in the paper, with lengths 3, 4 and 5.
The analyse_results script can analyse the results included for any of these test sets with its option -t (which defaults to the official test set) E.g.:
python analyse_results.py -t data/worldsense/other_tests/ood-sizeEach test-set directory listed above contains a file named trials.jsonl.bz2
This file is a bzip2-compressed .jsonl file in JSON-Lines format.
Each line contains a record as a dictionary mapping fieldnames to values. These records will also be called trials.
It can easily be read in as a pandas dataframe as follows:
from worldsense.benchmark import load_testset
testset_dir = "data/worldsense/test_set"
trials_df = load_testset(testset_dir) # Load all trials as a pandas dataframeThe load_testset function under the hood calls pandas.read_json('trials.jsonl.bz2', orient='records', lines=True) and does minimal cleaning (s.a. deobfuscating the goldresp field).
What one then gets is a pandas dataframe with several column fields, the most important being:
Key: a unique integer identifier for the recordtext: the question text to ask a language modelexpectedresp: the list of acceptable responsesgoldresp: the correct responseproblemname: the problem that the question is an instance ofskin: the skin that was used for rendering the question
Additional fields are there and allow for more or less fine grained analysis of accuracy results.
Each test-set directory also contains, besides the trials.jsonl.bz2 file, a results/ subdirectory. It contains one result file for each language model that was tested on that test set. E.g.:
ls data/worldsense/test_set/resultswill show:
basic___GPT3.5___results.jsonl
basic___GPT4___results.jsonl
basic___Llama2-FT+1M___results.jsonl
basic___Llama2-chat___results.jsonl
Result files follow a standard naming scheme:
<prompting>___<modelname>___results.jsonl
- modelname should indicate a specific language model and version (e.g. GPT4) and should also indicate if/how it has been finetuned.
- prompting should indicate the prompting strategy employed. It should be set to
basicto indicate that the question text was asked as is. Otherwise if experimenting with different prompting strategies (i.e. modifying or complementing the basic question-text s.a. inserting chain-of-thought instructions, or providing few-shot examples as additional context) it sould be used to specify what prompting strategy was employed.
Note: In the filename it is important to respect the triple underscores, and the ending in ___results.jsonl so that the analyse_results.py script can find these files and parse their name.
modelname and prompting are only used in reporting (for grouping results). Experimenters can set them to whatever they wish, to indicate their model and prompting strategy.
Result files are (uncompressed) .jsonl files in JSON-Lines format, with one dictionary record per line. These records are very simple as illustrated here:
head -10 data/worldsense/test_set/results/basic___GPT4___results.jsonlwill show:
{"Key":-276741083417243227,"resp":"1"}
{"Key":2747235547611487721,"resp":"1"}
{"Key":2917042815647934077,"resp":"3"}
{"Key":4803862128065633392,"resp":"1"}
{"Key":-2266038382228944547,"resp":"1"}
{"Key":-8972465100086251897,"resp":"3"}
{"Key":3540930086427752797,"resp":"TRUE"}
{"Key":633310297737009100,"resp":"FALSE"}
{"Key":6223161607606023176,"resp":"IMPOSSIBLE"}
{"Key":-5701584567800374128,"resp":"IMPOSSIBLE"}
Each record has only 2 fields:
Keymatches theKeyin the dataset's trials (the dataframe returned byload_testsetabove).respis the response given by the model to the corresponding trials' questiontext. It must be one of the possible responses from theexpectedresplist for that question. Or else if the model failed to provide one of these (even after being insistingly requeried),respshould be the empty string "".
Testing another model is thus a simple matter of:
- loading one of the test sets with
load_testsetas shown above. - open a results file following the naming scheme
<prompting>___<modelname>___results.jsonlfor writing inside that test-set'sresults/subdirectory. modelname should indicate your model's name and version and prompting should indicate your prompting strategy (usebasicif asking the question diretly). Make sure you use triple underscores in the filename to separate these. - looping over the test-set records, and for each:
- ask the model the question
text - extract the model's response
- verify that the model's response matches one of the acceptable responses in
expectedresp. If not try reprompting the model e.g. withOnly respond with one of these options:[expectedresp]. If the answer is still not a valid one set the response to the empty string "" - append this reponse as
resptogether with the associatedKeyto the results file.
- ask the model the question
- To get Worldsense accuracy score and basic analysis of the results, simply run
python analyse_results.py -ttestset-directory
This procedure is implemented in the simple test_random_model.py script. It employs a silly "language model" that just randomly picks an answer amongst the allowed expectedresp. Thus if you run
python test_random_model.py -t data/worldsense/test_setit will create a new results file called basic___random___results.jsonl inside data/worldsense/test_set/results/. That file will contain the responses of the random model.
If you then re-run
python analyse_results.py -t data/worldsense/test_setthe reported accuracy and bias tables will now have a new row for model random (under prompting=basic).
Remark: Expected chance level accuracy is 50% for all problems. Expected bias is 0 for this random LLM for both inference and consistency problems. But for the completeness problems (which have 3 possible responses) it is expected to be around 0.33, not 0. Since that model uniformly samples one of the 3 acceptable responses, it will 2/3 of the times pretend it knows the answer and only 1/3 of the time say it is not possible to decide. For this problem, bias mesures KNOWN (folding response 1 and 2) v.s. UNKNOWN (response 3), thus the random model will predominantly pretend it knows the answer rather than saying it's not decidable, hence the positive bias.
Notes:
- The code of
test_random_model.pycan easily be adapted to query your real language model. - The simple results file format makes it easily amendable to pausing/interrupting and resuming appending results to it.
analyse_results.pywill also happily analyse partial results files: it outputs average accuracy estimates even if not all the results are in yet (or if you don't want to run a costly language model on all 87048 rows of the official test set). In this case you'll obtain larger confidence intervals, that should shrink as more results are appended to the file.
Training sets are also provided for fine-tuning LLMs on the set of standard problems we test. These training sets use different "skins" than the standard test set.
| Name | Training-set file | Number of records (lines) |
|---|---|---|
| trials_10k | data/worldsense/training_set/trials_10k.bz2 | 12096 |
| trials_100k | data/worldsense/training_set/trials_100k.bz2 | 108864 |
| trials_1M | Downloadable 38Mb file http://dl.fbaipublicfiles.com/worldsense/trials_1M.jsonl.bz2 | 1091664 |
Important note: These training set files are distributed in bzip2-compressed JSON-Lines format, with one dictionary record per line. But the format of each record is quite different form that of test set files. Instead the file conforms to a typical format used for fine-tuning Llama-2 chat LLMs.
Each record is made of 2 fields:
dialog_history: dialogue as an array of messages each containg a role and contenttarget_message: specifies target. For example:POSSIBLE,IMPOSSIBLE,TRUE,FALSE, etc.
Example of such a record (pretty printed json):
{
"dialog_history":
{
"messages":
[
{
"role":"user",
"content":"Over his lifetime, Grandpa planted 3 trees in his garden... \nOnly respond with one of these 2 options: 'TRUE', 'FALSE' without any explanation."
}
]
},
"target_message":"TRUE"
}The training sets can be loaded in memory via e.g.:
from worldsense.benchmark import load_trainset
trainset = load_trainset("data/worldsense/training_set/trials_100k.jsonl.bz2")Remarks on analysing results, in case you want to do a more fine-grained analysis or fancier display of statistics
The basic analysis printed by the analyse_results.py script is done by calling functions defined in worldsense/benchmark.py:
analyse_results_in_testset_dirloads and assembles (joins) the test set's trials file and the associated results files into a single large pandas dataframe. It then callsanalyse_results_df.analyse_results_dfcomputes the basic errors and statistics and prints them.
analyse_results_df is the function you should take as starting inspiration if you want to do display or plot result statistics differently (e.g. within a jupyter notebook) or do a finer grained analysis (by filtering and grouping the data differently). For this it is important to follow similar steps to what analyse_resuts_df does, namely always:
- Call
compute_acc_table: this will reduce each dependent (non i.i.d.) tuple of trials to a single row and compute the appropriately weighted accuracy (accfield) andbias. - Call either
accuracy_by_groupto compute a table containing the average accuracy within each group orbias_by_groupto get the corresponding bias. It is important to call these functions rather than doing a normalgroupby(orpivot) aggregation by yourself: they do a proper equal reweighting before aggregating acrossproblemsizeand before aggregating acrossproblemname. They also correctly compute the corresponding confidence intervals. - [Optionally] call
pivot_prettyto get a display-friendly pivoted version of the accuracy or bias table obtained in step 2.pivot_prettyallows to choose what to put in rows (index) and columns, and to have each cell contain a nicely formatted string that combines both value and confidence interval (alternative content formatting for each cell is available by specifying a differentcell_styleargument, see documentation offormat_cellfunction inworldsense/analysis.pyfor possibilities).
Please see license file for information about usage.
@article{benchekroun2023worldsense,
title={WorldSense: A Synthetic Benchmark for Grounded Reasoning in Large Language Models},
author={Youssef Benchekroun and Megi Dervishi and Mark Ibrahim and Jean-Baptiste Gaya and Xavier Martinet and Grégoire Mialon and Thomas Scialom and Emmanuel Dupoux and Dieuwke Hupkes and Pascal Vincent},
year={2023},
eprint={2311.15930},
archivePrefix={arXiv},
primaryClass={cs.CL}
}