Skip to content

arklexai/arksim

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

151 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

⛵️ ArkSim

Find your agent's errors before your real users do.

CI Integration Tests Coverage PyPI Python License Docs GitHub Stars GitHub Issues PRs Welcome 2510.11997

Documentation · Examples · Report a Bug

arksim_find_your_agent.s_errors.mp4

What is ArkSim?

ArkSim simulates realistic multi-turn conversations between LLM-powered users and your agent, then evaluates performance across built-in and custom metrics. You define the scenarios (goals, profiles, knowledge) and ArkSim handles simulation and evaluation. Works with any agent that exposes a Chat Completions API or A2A protocol endpoint, or any Python agent loaded directly as a class.

ArkSim flow: Scenarios → Simulation → Evaluation → Reports

Why ArkSim?

  • Realistic simulations: LLM-powered users with distinct profiles, goals, and personality traits
  • Comprehensive evaluation: 7 built-in metrics covering helpfulness, coherence, faithfulness, goal completion, and more
  • Custom metrics: Define your own quantitative and qualitative metrics with full access to conversation context
  • Error detection: Automatically categorize agent failures (false information, disobeying requests, repetition) with severity levels
  • Protocol-agnostic: Works with Chat Completions API, A2A protocol, or any Python agent class directly
  • Multi-provider: Use OpenAI, Anthropic, or Google as the evaluation LLM
  • Parallel execution: Configurable concurrency for both simulation and evaluation
  • Visual reports: Interactive HTML reports with score breakdowns, error analysis, and full conversation viewer

Quickstart

Install

pip install arksim

For additional LLM providers:

pip install "arksim[all]"        # All providers
pip install "arksim[anthropic]"  # Anthropic only
pip install "arksim[google]"     # Google only

Set up credentials

export OPENAI_API_KEY="your-key"

Download examples

arksim examples

This creates an examples/ folder with ready-to-use projects (e-commerce, bank-insurance, openclaw), each containing a config.yaml and scenarios.json.

To create your own scenarios, see the Scenarios documentation.

Run

cd examples/e-commerce
arksim simulate-evaluate config.yaml

View results

Open the generated HTML report in ./results/evaluation/, or launch the web UI:

arksim ui

Agent Configuration

Agent configuration tells ArkSim how to connect to your agent. It is specified directly in your YAML config file. ArkSim supports three agent types:

Chat Completions API

agent_config:
  agent_type: chat_completions
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:8888/chat/completions
    headers:
      Content-Type: application/json
      Authorization: "Bearer ${AGENT_API_KEY}"
    body:
      messages:
        - role: system
          content: "You are a helpful assistant."

A2A (Agent-to-Agent) Protocol

agent_config:
  agent_type: a2a
  agent_name: my-agent
  api_config:
    endpoint: http://localhost:9999/agent

Environment variables in headers are resolved at runtime using ${VAR_NAME} syntax.

Custom Agent (Python)

Load your agent directly as a Python class - no HTTP server required.

agent_config:
  agent_type: custom
  agent_name: my-agent
  custom_config:
    module_path: ./my_agent.py

Your agent must subclass BaseAgent and implement get_chat_id() and execute():

from arksim.config import AgentConfig
from arksim.simulation_engine.agent.base import BaseAgent

class MyAgent(BaseAgent):
    def __init__(self, agent_config: AgentConfig) -> None:
        super().__init__(agent_config)
        # Initialize your agent here

    async def get_chat_id(self) -> str:
        return "unique-conversation-id"

    async def execute(self, user_query: str, **kwargs: object) -> str:
        # Your agent logic here
        return "agent response"

For code-based usage (no YAML needed), pass the class directly:

from arksim.config import AgentConfig, CustomConfig

agent_config = AgentConfig(
    agent_type="custom",
    agent_name=MyAgent.__name__,
    custom_config=CustomConfig(agent_class=MyAgent),
)

See the bank-insurance and e-commerce examples for full end-to-end Python scripts.

Evaluation Metrics

Built-in metrics

Metric Type Scale What it measures
Helpfulness Quantitative 1-5 How effectively the agent addresses user needs
Coherence Quantitative 1-5 Logical flow and consistency of responses
Relevance Quantitative 1-5 How on-topic the agent's responses are
Faithfulness Quantitative 1-5 Accuracy against provided knowledge (penalizes contradictions only)
Verbosity Quantitative 1-5 Whether response length is appropriate
Goal Completion Quantitative 0/1 Whether the user's stated goal was achieved
Agent Behavior Failure Qualitative Category Classifies errors: false information, disobeying requests, repetition, lack of specificity, failure to clarify

Custom metrics

Define quantitative metrics (numeric scores) by subclassing QuantitativeMetric:

from arksim.evaluator import QuantitativeMetric, QuantResult, ScoreInput

class ToneMetric(QuantitativeMetric):
    def __init__(self):
        super().__init__(
            name="tone_appropriateness",
            score_range=(0, 5),
            description="Evaluates whether the agent uses an appropriate tone",
        )

    def score(self, score_input: ScoreInput) -> QuantResult:
        # Access: score_input.chat_history, score_input.knowledge,
        #         score_input.user_goal, score_input.profile
        return QuantResult(
            name=self.name,
            value=4.0,
            reason="Agent maintained professional tone throughout",
        )

Define qualitative metrics (categorical labels) by subclassing QualitativeMetric:

from arksim.evaluator import QualitativeMetric, QualResult, ScoreInput

class SafetyCheckMetric(QualitativeMetric):
    def __init__(self):
        super().__init__(
            name="safety_check",
            description="Flags whether the agent produced unsafe content",
        )

    def evaluate(self, score_input: ScoreInput) -> QualResult:
        # Access: score_input.chat_history, score_input.knowledge,
        #         score_input.user_goal, score_input.profile
        return QualResult(
            name=self.name,
            value="safe",  # categorical label
            reason="No unsafe content detected",
        )

Add to your config:

custom_metrics_file_paths:
  - ./my_metrics.py

See the bank-insurance example for a full implementation with LLM-as-judge custom metrics.

Configuration Reference

All settings can be specified in YAML and overridden via CLI flags (--key value).

Simulation settings

Setting Type Default Description
agent_config object required Inline agent config (agent_type, agent_name, api_config or custom_config)
scenario_file_path string required Path to scenarios JSON
model string gpt-5.1 LLM model for simulated users
provider string openai LLM provider: openai, anthropic, google
num_conversations_per_scenario int 5 Conversations to generate per scenario
max_turns int 5 Maximum turns per conversation
num_workers int/string 50 Parallel workers
output_file_path string ./simulation.json Where to save simulation results
simulated_user_prompt_template string null Custom Jinja2 template for simulated user prompt

Evaluation settings

Setting Type Default Description
simulation_file_path string required Path to simulation output
output_dir string required Directory for evaluation results
model string gpt-5.1 LLM model for evaluation
provider string openai LLM provider
metrics_to_run list all metrics Which metrics to run
custom_metrics_file_paths list [] Paths to custom metric files
generate_html_report bool true Generate an HTML report
numeric_thresholds dict null Per-metric minimum scores on native scale. Built-in turn-level metrics use 1–5 (mean across turns per conversation); goal_completion and overall_score use 0–1. Unknown metric names are skipped with a warning.
qualitative_failure_labels dict null Failure labels per qualitative metric. Any evaluated turn whose label appears in the list fails the run; turns where the metric didn't run are skipped.
num_workers int/string 50 Parallel workers

Thresholds & exit codes

All threshold types are independent and optional (default null). Any failure exits with code 1.

Threshold Key How it works
Overall score numeric_thresholds.overall_score Fails if any conversation's overall_agent_score (0–1) is below the threshold
Per-metric numeric numeric_thresholds Fails if any conversation's mean score for a listed metric falls below its threshold. Use native scale: 1–5 for built-in turn-level metrics, 0–1 for goal_completion and overall_score
Qualitative qualitative_failure_labels Fails if any evaluated turn returns a label in the failure list
numeric_thresholds:
  overall_score: 0.6
  helpfulness: 3.5
  goal_completion: 0.7

qualitative_failure_labels:
  agent_behavior_failure: ["false information", "disobey user request"]
  prohibited_statements: ["violated"]

Deprecated: score_threshold is deprecated. Use numeric_thresholds: {overall_score: <value>} instead. The old key still works but logs a warning.

Exit codes:

Code Meaning
0 Success
1 Evaluation failed - threshold not met
2 Configuration error
3 Internal error

CLI Reference

arksim --version                        Show version and exit
arksim simulate <config.yaml>           Run agent simulations
arksim evaluate <config.yaml>           Evaluate simulation results
arksim simulate-evaluate <config.yaml>  Simulate then evaluate
arksim show-prompts [--category NAME]   Display evaluation prompts
arksim examples                         Download examples folder
arksim ui [--port PORT]                 Launch web UI (default: 8080)

Any config setting can be passed as a CLI flag:

arksim simulate config_simulate.yaml --max-turns 10 --num-workers 4 --verbose
arksim evaluate config_evaluate.yaml --score-threshold 0.7

Web UI

arksim ui

Opens a local web app at http://localhost:8080 where you can browse config files, run simulations with live log streaming, launch evaluations, and view interactive HTML reports.

Note: Provider credentials (e.g. OPENAI_API_KEY) must be set as environment variables before launching.

Examples

Example Description
bank-insurance Financial services agent with custom compliance metrics, adversarial scenarios, and a Chat Completions server
e-commerce E-commerce product recommendation agent with custom metrics
openclaw Integration with the OpenClaw agent framework
claude-agent-sdk Integration with the Claude Agent SDK
google-adk Integration with Google ADK
openai-agents-sdk Integration with the OpenAI Agents SDK
langchain Integration with LangChain
langgraph Integration with LangGraph
crewai Integration with CrewAI
autogen Integration with Microsoft AutoGen
llamaindex Integration with LlamaIndex
pydantic-ai Integration with Pydantic AI
rasa Integration with Rasa
smolagents Integration with Hugging Face Smolagents
mastra Integration with Mastra (TypeScript)
vercel-ai-sdk Integration with Vercel AI SDK (TypeScript)

CI Integration

Run ArkSim as a quality gate on every pull request so regressions are caught before they ship.

pytest (custom agent)

The simplest path if your agent is a Python class. CI runs pytest (no server needed).

# Copy templates into your repo
arksim examples ci
mkdir -p .github/workflows tests
cp examples/ci/pytest/arksim-pytest.yml .github/workflows/arksim-pytest.yml
cp examples/ci/pytest/test_agent_quality.py tests/test_agent_quality.py

Edit tests/test_agent_quality.py to import your agent class, set your thresholds, and add any custom metrics. The test simulates conversations, evaluates them, generates an HTML report, and asserts your quality gates, all in one pytest run.

HTTP server (any language or framework)

If your agent runs as an HTTP server exposing a Chat Completions or A2A endpoint:

arksim examples ci
mkdir -p .github/workflows
cp examples/ci/github-actions/arksim.yml .github/workflows/arksim.yml

The workflow starts your server, waits for it to be healthy, runs arksim simulate-evaluate, and exits non-zero if any threshold is not met.

Both approaches upload two artifacts after every run (pass or fail):

  • arksim-html-report - download, unzip, and open final_report.html in your browser
  • arksim-full-results - raw simulation and evaluation JSONs for programmatic analysis

See examples/ci/ for full templates and CI Integration docs for a step-by-step setup guide.


Development

git clone https://github.com/arklexai/arksim.git
cd arksim
pip install -e ".[dev]"
pytest tests/

Linting and formatting:

ruff check .
ruff format .

See CONTRIBUTING.md for guidelines.

License

Apache-2.0. See LICENSE.

Citation

@misc{shea2026sage,
      title={SAGE: A Top-Down Bottom-Up Knowledge-Grounded User Simulator for Multi-turn AGent Evaluation},
      author={Ryan Shea and Yunan Lu and Liang Qiu and Zhou Yu},
      year={2026},
      eprint={2510.11997},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2510.11997},
}

Packages

 
 
 

Contributors