Find your agent's errors before your real users do.
Documentation · Examples · Report a Bug
arksim_find_your_agent.s_errors.mp4
ArkSim simulates realistic multi-turn conversations between LLM-powered users and your agent, then evaluates performance across built-in and custom metrics. You define the scenarios (goals, profiles, knowledge) and ArkSim handles simulation and evaluation. Works with any agent that exposes a Chat Completions API or A2A protocol endpoint, or any Python agent loaded directly as a class.
- Realistic simulations: LLM-powered users with distinct profiles, goals, and personality traits
- Comprehensive evaluation: 7 built-in metrics covering helpfulness, coherence, faithfulness, goal completion, and more
- Custom metrics: Define your own quantitative and qualitative metrics with full access to conversation context
- Error detection: Automatically categorize agent failures (false information, disobeying requests, repetition) with severity levels
- Protocol-agnostic: Works with Chat Completions API, A2A protocol, or any Python agent class directly
- Multi-provider: Use OpenAI, Anthropic, or Google as the evaluation LLM
- Parallel execution: Configurable concurrency for both simulation and evaluation
- Visual reports: Interactive HTML reports with score breakdowns, error analysis, and full conversation viewer
pip install arksimFor additional LLM providers:
pip install "arksim[all]" # All providers
pip install "arksim[anthropic]" # Anthropic only
pip install "arksim[google]" # Google onlyexport OPENAI_API_KEY="your-key"arksim examplesThis creates an examples/ folder with ready-to-use projects (e-commerce, bank-insurance, openclaw), each containing a config.yaml and scenarios.json.
To create your own scenarios, see the Scenarios documentation.
cd examples/e-commerce
arksim simulate-evaluate config.yamlOpen the generated HTML report in ./results/evaluation/, or launch the web UI:
arksim uiAgent configuration tells ArkSim how to connect to your agent. It is specified directly in your YAML config file. ArkSim supports three agent types:
agent_config:
agent_type: chat_completions
agent_name: my-agent
api_config:
endpoint: http://localhost:8888/chat/completions
headers:
Content-Type: application/json
Authorization: "Bearer ${AGENT_API_KEY}"
body:
messages:
- role: system
content: "You are a helpful assistant."agent_config:
agent_type: a2a
agent_name: my-agent
api_config:
endpoint: http://localhost:9999/agentEnvironment variables in headers are resolved at runtime using ${VAR_NAME} syntax.
Load your agent directly as a Python class - no HTTP server required.
agent_config:
agent_type: custom
agent_name: my-agent
custom_config:
module_path: ./my_agent.pyYour agent must subclass BaseAgent and implement get_chat_id() and execute():
from arksim.config import AgentConfig
from arksim.simulation_engine.agent.base import BaseAgent
class MyAgent(BaseAgent):
def __init__(self, agent_config: AgentConfig) -> None:
super().__init__(agent_config)
# Initialize your agent here
async def get_chat_id(self) -> str:
return "unique-conversation-id"
async def execute(self, user_query: str, **kwargs: object) -> str:
# Your agent logic here
return "agent response"For code-based usage (no YAML needed), pass the class directly:
from arksim.config import AgentConfig, CustomConfig
agent_config = AgentConfig(
agent_type="custom",
agent_name=MyAgent.__name__,
custom_config=CustomConfig(agent_class=MyAgent),
)See the bank-insurance and e-commerce examples for full end-to-end Python scripts.
| Metric | Type | Scale | What it measures |
|---|---|---|---|
| Helpfulness | Quantitative | 1-5 | How effectively the agent addresses user needs |
| Coherence | Quantitative | 1-5 | Logical flow and consistency of responses |
| Relevance | Quantitative | 1-5 | How on-topic the agent's responses are |
| Faithfulness | Quantitative | 1-5 | Accuracy against provided knowledge (penalizes contradictions only) |
| Verbosity | Quantitative | 1-5 | Whether response length is appropriate |
| Goal Completion | Quantitative | 0/1 | Whether the user's stated goal was achieved |
| Agent Behavior Failure | Qualitative | Category | Classifies errors: false information, disobeying requests, repetition, lack of specificity, failure to clarify |
Define quantitative metrics (numeric scores) by subclassing QuantitativeMetric:
from arksim.evaluator import QuantitativeMetric, QuantResult, ScoreInput
class ToneMetric(QuantitativeMetric):
def __init__(self):
super().__init__(
name="tone_appropriateness",
score_range=(0, 5),
description="Evaluates whether the agent uses an appropriate tone",
)
def score(self, score_input: ScoreInput) -> QuantResult:
# Access: score_input.chat_history, score_input.knowledge,
# score_input.user_goal, score_input.profile
return QuantResult(
name=self.name,
value=4.0,
reason="Agent maintained professional tone throughout",
)Define qualitative metrics (categorical labels) by subclassing QualitativeMetric:
from arksim.evaluator import QualitativeMetric, QualResult, ScoreInput
class SafetyCheckMetric(QualitativeMetric):
def __init__(self):
super().__init__(
name="safety_check",
description="Flags whether the agent produced unsafe content",
)
def evaluate(self, score_input: ScoreInput) -> QualResult:
# Access: score_input.chat_history, score_input.knowledge,
# score_input.user_goal, score_input.profile
return QualResult(
name=self.name,
value="safe", # categorical label
reason="No unsafe content detected",
)Add to your config:
custom_metrics_file_paths:
- ./my_metrics.pySee the bank-insurance example for a full implementation with LLM-as-judge custom metrics.
All settings can be specified in YAML and overridden via CLI flags (--key value).
| Setting | Type | Default | Description |
|---|---|---|---|
agent_config |
object | required | Inline agent config (agent_type, agent_name, api_config or custom_config) |
scenario_file_path |
string | required | Path to scenarios JSON |
model |
string | gpt-5.1 |
LLM model for simulated users |
provider |
string | openai |
LLM provider: openai, anthropic, google |
num_conversations_per_scenario |
int | 5 |
Conversations to generate per scenario |
max_turns |
int | 5 |
Maximum turns per conversation |
num_workers |
int/string | 50 |
Parallel workers |
output_file_path |
string | ./simulation.json |
Where to save simulation results |
simulated_user_prompt_template |
string | null | Custom Jinja2 template for simulated user prompt |
| Setting | Type | Default | Description |
|---|---|---|---|
simulation_file_path |
string | required | Path to simulation output |
output_dir |
string | required | Directory for evaluation results |
model |
string | gpt-5.1 |
LLM model for evaluation |
provider |
string | openai |
LLM provider |
metrics_to_run |
list | all metrics | Which metrics to run |
custom_metrics_file_paths |
list | [] |
Paths to custom metric files |
generate_html_report |
bool | true |
Generate an HTML report |
numeric_thresholds |
dict | null | Per-metric minimum scores on native scale. Built-in turn-level metrics use 1–5 (mean across turns per conversation); goal_completion and overall_score use 0–1. Unknown metric names are skipped with a warning. |
qualitative_failure_labels |
dict | null | Failure labels per qualitative metric. Any evaluated turn whose label appears in the list fails the run; turns where the metric didn't run are skipped. |
num_workers |
int/string | 50 |
Parallel workers |
All threshold types are independent and optional (default null). Any failure exits with code 1.
| Threshold | Key | How it works |
|---|---|---|
| Overall score | numeric_thresholds.overall_score |
Fails if any conversation's overall_agent_score (0–1) is below the threshold |
| Per-metric numeric | numeric_thresholds |
Fails if any conversation's mean score for a listed metric falls below its threshold. Use native scale: 1–5 for built-in turn-level metrics, 0–1 for goal_completion and overall_score |
| Qualitative | qualitative_failure_labels |
Fails if any evaluated turn returns a label in the failure list |
numeric_thresholds:
overall_score: 0.6
helpfulness: 3.5
goal_completion: 0.7
qualitative_failure_labels:
agent_behavior_failure: ["false information", "disobey user request"]
prohibited_statements: ["violated"]Deprecated:
score_thresholdis deprecated. Usenumeric_thresholds: {overall_score: <value>}instead. The old key still works but logs a warning.
Exit codes:
| Code | Meaning |
|---|---|
0 |
Success |
1 |
Evaluation failed - threshold not met |
2 |
Configuration error |
3 |
Internal error |
arksim --version Show version and exit
arksim simulate <config.yaml> Run agent simulations
arksim evaluate <config.yaml> Evaluate simulation results
arksim simulate-evaluate <config.yaml> Simulate then evaluate
arksim show-prompts [--category NAME] Display evaluation prompts
arksim examples Download examples folder
arksim ui [--port PORT] Launch web UI (default: 8080)
Any config setting can be passed as a CLI flag:
arksim simulate config_simulate.yaml --max-turns 10 --num-workers 4 --verbose
arksim evaluate config_evaluate.yaml --score-threshold 0.7arksim uiOpens a local web app at http://localhost:8080 where you can browse config files, run simulations with live log streaming, launch evaluations, and view interactive HTML reports.
Note: Provider credentials (e.g.
OPENAI_API_KEY) must be set as environment variables before launching.
| Example | Description |
|---|---|
| bank-insurance | Financial services agent with custom compliance metrics, adversarial scenarios, and a Chat Completions server |
| e-commerce | E-commerce product recommendation agent with custom metrics |
| openclaw | Integration with the OpenClaw agent framework |
| claude-agent-sdk | Integration with the Claude Agent SDK |
| google-adk | Integration with Google ADK |
| openai-agents-sdk | Integration with the OpenAI Agents SDK |
| langchain | Integration with LangChain |
| langgraph | Integration with LangGraph |
| crewai | Integration with CrewAI |
| autogen | Integration with Microsoft AutoGen |
| llamaindex | Integration with LlamaIndex |
| pydantic-ai | Integration with Pydantic AI |
| rasa | Integration with Rasa |
| smolagents | Integration with Hugging Face Smolagents |
| mastra | Integration with Mastra (TypeScript) |
| vercel-ai-sdk | Integration with Vercel AI SDK (TypeScript) |
Run ArkSim as a quality gate on every pull request so regressions are caught before they ship.
The simplest path if your agent is a Python class. CI runs pytest (no server needed).
# Copy templates into your repo
arksim examples ci
mkdir -p .github/workflows tests
cp examples/ci/pytest/arksim-pytest.yml .github/workflows/arksim-pytest.yml
cp examples/ci/pytest/test_agent_quality.py tests/test_agent_quality.pyEdit tests/test_agent_quality.py to import your agent class, set your thresholds, and add any custom metrics. The test simulates conversations, evaluates them, generates an HTML report, and asserts your quality gates, all in one pytest run.
If your agent runs as an HTTP server exposing a Chat Completions or A2A endpoint:
arksim examples ci
mkdir -p .github/workflows
cp examples/ci/github-actions/arksim.yml .github/workflows/arksim.ymlThe workflow starts your server, waits for it to be healthy, runs arksim simulate-evaluate, and exits non-zero if any threshold is not met.
Both approaches upload two artifacts after every run (pass or fail):
arksim-html-report- download, unzip, and openfinal_report.htmlin your browserarksim-full-results- raw simulation and evaluation JSONs for programmatic analysis
See examples/ci/ for full templates and CI Integration docs for a step-by-step setup guide.
git clone https://github.com/arklexai/arksim.git
cd arksim
pip install -e ".[dev]"
pytest tests/Linting and formatting:
ruff check .
ruff format .See CONTRIBUTING.md for guidelines.
Apache-2.0. See LICENSE.
@misc{shea2026sage,
title={SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation},
author={Ryan Shea and Yunan Lu and Liang Qiu and Zhou Yu},
year={2026},
eprint={2510.11997},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2510.11997},
}