Evaluation reproducing issues

Thanks for the great work. I'm trying to reproduce the results and facing following errors:
1. Can I use lm-evaluation-harness script instead of yours to evaluate the results? When I used lm-harness ammlu dataset, I got 34.1 accuracy as compared to yours 37. What could be the difference?

2. How to use this script for another model's evaluation?
i. When I changed the model to jais-13b it gave 0% accuracy on Ammlu (all the responses are empty string).
ii. On any other model such as Phi-2, MobiLlama-1B, I am getting the following error: 
<img width="980" alt="image" src="https://github.com/FreedomIntelligence/AceGPT/assets/56957881/3349f8b2-b631-45f2-a383-6f692312a76a">

below are the changes I made to config.yaml:
<img width="418" alt="image" src="https://github.com/FreedomIntelligence/AceGPT/assets/56957881/369fe79c-61db-44be-a4e6-3eca7eb02ba9">

and in ArabicMMLU_few_shots.sh, I changed the model id to Phi-2B-base. Can you please tell me the solution of this?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation reproducing issues #12

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Evaluation reproducing issues #12

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions