I feel a little bit confused with you two evaluation methods.
May I know can I directly use the Minilongbench as the test samples for evaluating LLMs.
And we use minilongbench_scorer.py to obtain the final scores?
or any other post-processing is necessary?
It is really confused.
Are the final score stored in (one of the two methods) eval_data//example_minilongbench_scores.pkl?