JudgeAnything

See our demo at web

Todo

Release the Benchmark and the Dataset
Gather code for auto-evaluation on our benchmark
Release the human-annotated ground truth
Construct the OmniArena platform

Data

We have released our benchmark and models responses in dataset, you should download them into dataset/

Reproduction of auto-evaluation

You can use scripts under scripts/ to re-run the evaluation on existing data.

API

We now provide the judging code for Gemini-1.5-pro, Gemini-2.0-flash, Gemini-2.0-flash-lite.

Local

We now provide the judging code for Phi4Multimodal We will provide the judging code for Qwen2.5 Omni and InternOmni-7B We also provides an extensive huggingface-support interface to evaluate, the prompt template is Phi4v-multimodal, you can modify the local special token template in scripts/utils/config.py or modify the procedure of building local prompt template in scripts/utils/prompt_builder.py.

General Evaluation

To evaluate your own private generated data, transform the data into following format to do the score evaluation.(We are still working on gathering general code for our OmniArena's pair comparison code). We encourage you to get the prompt in scripts/utils/prompt_builder.py and evaluate yourself

[
    {
        "uniq_id": "the uniq_id of task",
        "response_id": "create a response_id for your response",
        "task_name": "task_name of task",
        "model_name": "your model name",
        "response": {
            "type": "text/image/video/audio",
            "content": "string or list, if non-text type, provide the abspath here"
        }
    }
]

Citation

@article{pu2025judge,
  title={Judge Anything: MLLM as a Judge Across Any Modality},
  author={Pu, Shu and Wang, Yaochen and Chen, Dongping and Chen, Yuhang and Wang, Guohao and Qin, Qi and Zhang, Zhongyi and Zhang, Zhiyuan and Zhou, Zetong and Gong, Shuang and others},
  journal={arXiv preprint arXiv:2503.17489},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
dataset		dataset
scripts		scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

JudgeAnything

Todo

Data

Reproduction of auto-evaluation

API

Local

General Evaluation

Citation

About

Uh oh!

Releases 1

Packages

Languages

URRealHero/JudgeAnything

Folders and files

Latest commit

History

Repository files navigation

JudgeAnything

Todo

Data

Reproduction of auto-evaluation

API

Local

General Evaluation

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages