Summary
We'd like to propose an abstraction that enables EVA to benchmark hosted voice agent APIs (native voice agents beyond Pipecat) — directly addressing the Extended Leaderboard roadmap item:
More cascade and speech-to-speech systems evaluated and added to the findings including native voice agent APIs beyond Pipecat (ElevenAgents, Gemini Live, OpenAI Realtime, Deepgram Voice Agent, etc.)
We've built a working implementation in our fork, run a full 50 records × 3 trials evaluation, and wanted to discuss the approach before submitting a PR.
The Problem
EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are hosted black boxes — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.
Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.
Proposed Architecture
An External Agent Provider plugin interface that lets EVA evaluate any hosted voice agent:
┌──────────────────────────────────────────────────────────────────┐
│ EVA Runner │
│ │
│ ┌──────────────┐ ┌──────────────────────┐ ┌────────────┐ │
│ │ ElevenLabs │ WS │ External Agent │ │ Tool │ │
│ │ User Sim │◄───►│ Bridge (generic) │ │ Webhook │ │
│ └──────────────┘ └──────────┬───────────┘ └──────┬─────┘ │
│ Provider │ │ │
│ Transport│ │ │
└──────────────────────────────────┼───────────────────────┼───────┘
│ │
┌────────▼─────────┐ ┌────────▼───────┐
│ Provider │ │ External │
│ Platform │───►│ Voice Agent │
└──────────────────┘ └────────────────┘
Core abstraction: ExternalAgentProvider
class ExternalAgentProvider(ABC):
@abstractmethod
def create_transport(self, conversation_id, webhook_base_url) -> BaseTelephonyTransport:
"""Create a transport connection to the external agent."""
@abstractmethod
async def setup(self) -> None:
"""One-time setup before a benchmark run (e.g., configure assistant via API)."""
@abstractmethod
async def teardown(self) -> None:
"""Clean up after a benchmark run."""
@abstractmethod
async def fetch_intended_speech(self, transport) -> list[dict]:
"""Post-call enrichment: fetch what the agent intended to say (for metrics)."""
def register_webhook_routes(self, app: FastAPI) -> None:
"""Optional: register provider-specific webhook endpoints."""
Adding a new provider requires only a new providers/xxx/ directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.
Two-layer transport design
The transport layer has two independent abstractions that can be mixed and matched:
BaseTelephonyTransport — the audio pipe (start/stop/send_audio/emit_audio). Different agents connect differently: SIP, WebSocket, WebRTC. Each is a different transport.
ExternalAgentProvider — the lifecycle layer that creates transports and manages agent configuration.
These are independent: any SIP-reachable agent could reuse the Telnyx Call Control transport with a different provider. A WebSocket-based transport (for ElevenLabs) just needs start(), stop(), send_audio().
Why a client-side bridge?
The bridge sits between the user simulator and the external agent. This is intentional:
- Metrics compatibility — captures
pipecat_logs.jsonl format from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't expose
- Platform-independent latency measurement — third-party observer, not self-reported by the provider, making cross-provider comparisons fair
Important caveat: Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures relative performance across providers under identical conditions — not absolute production latency.
Changes to Core EVA
We've kept upstream changes minimal. Here's every file we touch and why:
| Area |
Files |
Lines |
Why |
| User simulator |
client.py, audio_interface.py |
+63 |
audio_codec param (pcm/mulaw) — external agents may use different formats |
| Config |
models/config.py |
+131 |
ExternalAgentConfig base + discriminator (same pattern as existing config hierarchy) |
| Orchestrator |
runner.py, worker.py |
+171 |
Provider lifecycle + bridge instantiation (parallel to existing AssistantServer logic) |
| Metrics base |
metrics/base.py |
+20 |
message_trace field + helper methods for turn counting |
| Metrics processor |
metrics/processor.py |
+168 |
Control event filtering, sort priority, message_trace loading (general correctness improvements) |
| Individual metrics |
5 files |
+30 |
Use base class helpers instead of direct trace access |
| Tools |
airline_tools.py |
+5 |
end_call stub (webhook intercepts it; stub passes tool validation) |
| Audit log |
audit_log.py |
+8 |
replace_transcript() for post-call enrichment |
Everything else (~2000 lines) is purely additive in a new assistant/external/ package.
Relationship to PR #14
We noticed the concern in PR #14 about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (message_trace.jsonl, pipecat_logs.jsonl) before the metrics processor sees them — the processor doesn't need to know about bridge mode.
Benchmark Results
We ran the full EVA benchmark (50 records × 3 trials, 150 conversations) using our Telnyx provider. The Telnyx AI Assistant was configured with all self-hosted components on the Telnyx platform:
- LLM: Kimi-K2.5 (Moonshot AI)
- STT: Deepgram Flux
- TTS: Telnyx Ultra
| Metric |
Score |
| EVA-A_pass@1 (Accuracy) |
0.48 |
| EVA-X_pass@1 (Experience) |
0.67 |
| EVA-A_mean |
0.71 |
| EVA-X_mean |
0.63 |
| EVA-overall_mean |
0.67 |
Here's how Telnyx (Kimi-K2.5) places on the EVA leaderboard at pass@1:

Telnyx sits above the current Pareto frontier — meaningfully ahead of existing cascade and audio-native systems on Experience, with competitive Accuracy.
Full metrics: metrics_summary.json
Why this matters for the Extended Leaderboard: Hosted voice agent APIs that own the full stack (telephony + STT + LLM + TTS) can optimize end-to-end in ways that component-level Pipecat cascades can't. Adding support for external providers doesn't just expand coverage — it pushes the Pareto frontier and raises the bar for what "good" looks like on the leaderboard.
Resources
Next Steps
We'd love feedback on:
- Does this abstraction align with your vision for the Extended Leaderboard?
- Are the core changes acceptable? We've tried to minimize them, but we're open to alternative approaches (e.g., making the metrics processor fully pluggable).
- Would you prefer a single PR or a series of smaller ones?
Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.
Summary
We'd like to propose an abstraction that enables EVA to benchmark hosted voice agent APIs (native voice agents beyond Pipecat) — directly addressing the Extended Leaderboard roadmap item:
We've built a working implementation in our fork, run a full 50 records × 3 trials evaluation, and wanted to discuss the approach before submitting a PR.
The Problem
EVA currently evaluates voice agents that run inside a Pipecat pipeline (STT → LLM → TTS or S2S). But many production voice agents are hosted black boxes — you call them via a phone number or WebSocket and interact over audio. There's no Pipecat pipeline to instrument.
Examples: Telnyx AI Assistants, ElevenLabs Conversational AI, Deepgram Voice Agent, OpenAI Realtime API, Google Gemini Live.
Proposed Architecture
An External Agent Provider plugin interface that lets EVA evaluate any hosted voice agent:
Core abstraction:
ExternalAgentProviderAdding a new provider requires only a new
providers/xxx/directory with config, transport, and setup. No changes to the bridge, worker, runner, or metrics pipeline.Two-layer transport design
The transport layer has two independent abstractions that can be mixed and matched:
BaseTelephonyTransport— the audio pipe (start/stop/send_audio/emit_audio). Different agents connect differently: SIP, WebSocket, WebRTC. Each is a different transport.ExternalAgentProvider— the lifecycle layer that creates transports and manages agent configuration.These are independent: any SIP-reachable agent could reuse the Telnyx Call Control transport with a different provider. A WebSocket-based transport (for ElevenLabs) just needs
start(),stop(),send_audio().Why a client-side bridge?
The bridge sits between the user simulator and the external agent. This is intentional:
pipecat_logs.jsonlformat from raw audio (VAD events, turn boundaries, timestamps) that the external agent's API doesn't exposeImportant caveat: Real production calls don't have this intermediary. Benchmark latency will be higher than real-world latency. EVA measures relative performance across providers under identical conditions — not absolute production latency.
Changes to Core EVA
We've kept upstream changes minimal. Here's every file we touch and why:
client.py,audio_interface.pyaudio_codecparam (pcm/mulaw) — external agents may use different formatsmodels/config.pyExternalAgentConfigbase + discriminator (same pattern as existing config hierarchy)runner.py,worker.pymetrics/base.pymessage_tracefield + helper methods for turn countingmetrics/processor.pyairline_tools.pyend_callstub (webhook intercepts it; stub passes tool validation)audit_log.pyreplace_transcript()for post-call enrichmentEverything else (~2000 lines) is purely additive in a new
assistant/external/package.Relationship to PR #14
We noticed the concern in PR #14 about Pipecat log coupling. Our approach addresses this by having the bridge write normalized output files (
message_trace.jsonl,pipecat_logs.jsonl) before the metrics processor sees them — the processor doesn't need to know about bridge mode.Benchmark Results
We ran the full EVA benchmark (50 records × 3 trials, 150 conversations) using our Telnyx provider. The Telnyx AI Assistant was configured with all self-hosted components on the Telnyx platform:
Here's how Telnyx (Kimi-K2.5) places on the EVA leaderboard at pass@1:
Telnyx sits above the current Pareto frontier — meaningfully ahead of existing cascade and audio-native systems on Experience, with competitive Accuracy.
Full metrics:
metrics_summary.jsonWhy this matters for the Extended Leaderboard: Hosted voice agent APIs that own the full stack (telephony + STT + LLM + TTS) can optimize end-to-end in ways that component-level Pipecat cascades can't. Adding support for external providers doesn't just expand coverage — it pushes the Pareto frontier and raises the bar for what "good" looks like on the leaderboard.
Resources
Next Steps
We'd love feedback on:
Happy to iterate on the design before submitting any code. Thanks for building EVA — it's been great to work with.