ToolArena is a benchmark for evaluating how well Large Language Model (LLM) agents can create "tools" from GitHub repositories. Each tool (or task) corresponds to a Python function with defined inputs and outputs, wrapped in a containerised environment for reproducibility and testability.
ToolArena contains multiple tasks under the tasks/ directory. Each task is:
- Self-contained: it has its own
task.yamldescribing its name, inputs, outputs, example invocation, and test invocations. - Implementation-agnostic: we provide a reference implementation in
implementation.pywith a companioninstall.shthat sets up dependencies. - Tested: A
tests.pyscript ensures that any candidate implementation is correct.
By default, each task's folder contains:
tasks/<TASK_NAME>/
├── task.yaml
├── implementation.py
├── install.sh
├── tests.py
└── data/
├── download.sh
└── ...
You can use this benchmark to evaluate how well your LLM agent can create implementations for the tasks defined in ToolArena. You can also propose your own new tasks to contribute to the benchmark.
We welcome new tasks and improvements to existing ones. See CONTRIBUTING.md for a full guide on how to contribute a new task.
- Install
uvcurl -LsSf https://astral.sh/uv/install.sh | sh - Sync dependencies for all environment groups. In the root of this repo, run:
This creates a virtual environment at
uv sync --all-groups
.venv/and installs all necessary Python dependencies. - Activate the environment:
source .venv/bin/activate - Verify installation to check the
toolarenacommand is available:toolarena --help
- Install Docker, then pull the latest ToolArena images:
docker pull ghcr.io/katherlab/toolarena:cpu docker pull ghcr.io/katherlab/toolarena:cuda
Create a .env file in the repository’s root directory with at least these variables:
CUDA_VISIBLE_DEVICES=0 # If you have a GPU, set the device ID here; otherwise, you can leave it blank
HF_TOKEN=hf_... # Replace with your HuggingFace token
OPENAI_API_KEY=sk-... # Replace with your OpenAI API keyYou will need to request access to the following HuggingFace repositories, which are required by some of the tasks:
- MahmoodLab/UNI (for
uni_extract_features) - MahmoodLab/CONCH (for
conch_extract_features) - xiangjx/musk (for
musk_extract_features) - KatherLab/COBRA (for
cobra_extract_featuresandcobra_heatmaps) - pixas/MedSSS_Policy (for
medsss_generate) - YukunZhou/RETFound_mae_natureCFP (for
retfound_feature_vector) - KatherLab/MoPaDi (for
mopadi_generate_counterfactuals)
An OpenAI API key is required for the textgrad_medical_qa_optimize task.
All tasks live under the tasks/ directory. Each contains:
- A task definition (
task.yaml) describing:- Name
- Repository (source GitHub repo & commit)
- Inputs and outputs
- Example and test invocations, including any data supplied as input for these invocations
- A tests file (
tests.py) withpytesttests verifying correctness of the implementation. - A reference implementation consisting of:
- An installation script (
install.sh) that clones the repository and installs necessary dependencies. - A Python implementation (
implementation.py) that provides a reference function for the given task.
- An installation script (
Some tasks require large external data files as input (e.g. images, datasets, etc.).
Each task may specify a data/download.sh script, which downloads all required external data from publicly available repositories.
While may execute these download scripts yourself, ToolArena provides a convenient command to download all data for you:
toolarena downloadIf you want to download the data for one specific task, simply run toolarena download <TASK_NAME> instead.
A candidate implementation must have:
install.sh
Installs all necessary dependencies for the tool (including cloning the associated repository)implementation.py
Contains the Python function that matches the signature defined in thetask.yaml.
Suppose you have created a folder <IMPLEMENTATION_DIR> (outside of tasks/) that looks like:
<IMPLEMENTATION_DIR>/
└── <TASK_NAME>/
├── install.sh
└── implementation.py
Here's how to run it:
toolarena run <TASK_NAME> <INVOCATION> --implementation <IMPLEMENTATION_DIR>Where:
<TASK_NAME>is the name of your task (folder name intasks/).<INVOCATION>is eitherexample(the example intask.yaml) or any named test invocation.<IMPLEMENTATION_DIR>is the name of the aforementioned implementations directory you created.
Note
If you omit <INVOCATION>, ToolArena runs all invocations (the example plus every test invocation) for that task.
Because ToolArena caches results, repeated runs with the same inputs do not require rerunning the entire tool.
To run the benchmark (i.e., the entire battery of tests) on your candidate implementations:
pytest tasks --implementation <IMPLEMENTATION_DIR>Tip
You can refine your test runs:
- Skip uncached invocations:
Only tests for which you already have cached results will run.
pytest tasks --implementation <IMPLEMENTATION_DIR> --skip-uncached
- Run only one task:
pytest tasks/<TASK_NAME> --implementation <IMPLEMENTATION_DIR>
Each task provides a human-generated reference implementation to prove that the task is possible.
The reference implementations are supplied alongside the task definitions in the tasks/ directory.
To run the reference implementation for any task, simply omit the --implementation flag:
# Run a single invocation (example invocation):
toolarena run <TASK_NAME> example
# Or run all invocations for that task:
toolarena run <TASK_NAME>And to run all tasks' tests (the entire benchmark) with reference implementations:
pytest tasksIf you need to inspect a running container or attach a debugger:
-
Debug a specific invocation:
toolarena debug <TASK_NAME> <INVOCATION_NAME> --implementation <IMPLEMENTATION_DIR>
This starts the container and provides instructions to attach VS Code or open a bash session in the container.
-
Check logs directly in Docker:
toolarena run <TASK_NAME> <INVOCATION> --implementation <IMPLEMENTATION_DIR> docker logs -f <TASK_NAME>
This streams any output from the tool in real-time. Under the hood, the
toolarena runcommand creates a Docker container with the same name as the task.
Feel free to open Issues or Pull Requests if you encounter problems or want to propose improvements. Happy tool-building!