- Palo Alto, CA
- https://jianguoz.github.io/
- @JianguoZhang3
Stars
τ²-Bench-Verified is a corrected and verified version of the original τ²-bench benchmark. This release addresses issues discovered in the original dataset where task definitions, expected actions, …
Some reading notes edited in LaTeX. 一些学习笔记,使用LaTeX编辑.
MCP-based Agent Deep Evaluation System
Build, Evaluate, and Optimize AI Systems. Includes evals, RAG, agents, fine-tuning, synthetic data generation, dataset management, MCP, and more.
τ²-Bench: Evaluating Conversational Agents in a Dual-Control Environment
An extremely fast Python package and project manager, written in Rust.
Designing Multi-Agent Systems with Zero Supervision
xLAM: A Family of Large Action Models to Empower AI Agent Systems
An extensible benchmark for evaluating large language models on planning
A reading list on LLM based Synthetic Data Generation 🔥
Chat with any codebase in under two minutes | Fully local or via third-party APIs
Official code for Paper "Mantis: Multi-Image Instruction Tuning" [TMLR 2024 Best Paper]
LLMs build upon Evol Insturct: WizardLM, WizardCoder, WizardMath
🐫 CAMEL: The first and the best multi-agent framework. Finding the Scaling Law of Agents. https://www.camel-ai.org
AndroidWorld is an environment and benchmark for autonomous agents
m&ms: A Benchmark to Evaluate Tool-Use for multi-step multi-modal tasks
Human preference data for "Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback"
ToolkenGPT: Augmenting Frozen Language Models with Massive Tools via Tool Embeddings - NeurIPS 2023 (oral)
[ACL2024 Findings] Agent-FLAN: Designing Data and Methods of Effective Agent Tuning for Large Language Models
Python SDK, Proxy Server (AI Gateway) to call 100+ LLM APIs in OpenAI (or native) format, with cost tracking, guardrails, loadbalancing and logging. [Bedrock, Azure, OpenAI, VertexAI, Cohere, Anthr…
Chat language model that can use tools and interpret the results



