Projects

StringSight

StringSight

Turning Model Conversations and Agentic Traces into Actionable Insights
DS-Serve

DS-Serve

A Framework for Efficient and Scalable Neural Retrieval
SkyLight

SkyLight

Advancing the Frontier of Sparse Attention Research
Interruptible LRMs

Interruptible LRMs

Are Large Reasoning Models Interruptible?
rLLM

rLLM

Democratizing Reinforcement Learning for LLMs
ADRS

ADRS

AI-Driven Research Systems
kvcached

kvcached

Elastic KV Cache for Dynamic GPU Sharing and Efficient Multi-LLM Inference
GEPA

GEPA

System Optimization through Reflective Text Evolution

MiniScope

A Least Privilege Framework for Authorizing Tool Calling Agents

AgentThink

A Systematic Evaluation Framework that Automatically Identifies Failure Patterns in LLMs
vCache

vCache

Reliable and Efficient Semantic Prompt Caching

DeepScholar-Bench

A Live Benchmark for Generative Research Synthesis
LEANN

LEANN

Fast, Accurate, and 100% Private RAG on your Laptop
REVERSE

REVERSE

Retrospective Verification and Self-Correction
GSO

GSO

Challenging Software Optimization Tasks for Evaluating SWE-Agents

Matryoshka

Semantic-Aware Parsing for Security Logs
Search Arena

Search Arena

A Crowdsourced In-The-Wild Evaluation Platform for Search-Augmented LLM Systems Based on Human Preference
SkyRL

SkyRL

Online RL Training for Real-World Long-Horizon Agents
SkyServe

SkyServe

Serving AI Models across Regions and Clouds with Spot Instances
Myco

Myco

Unlocking Polylogarithmic Accesses in Metadata-Private Messaging

R2E-Gym

Procedural Environment Generation and Hybrid Verifiers for Scaling Open-Weights SWE Agents
MAST

MAST

Multi-Agent System Failure Taxonomy
Agentica Project

Agentica Project

Building Generalist Agents That Scale
BARE

BARE

A Method for Combining Base Language Models and Instruction-Tuned Language Models for Better Synthetic Data Generation.
Ember

Ember

A Compositional Framework for Building and Deploying Large Inference-Time Scaling Architectures and Strategies
TAG-Bench

TAG-Bench

A Benchmark for Table-Augmented Generation
UCCL

UCCL

An Efficient Collective Communication Library for GPUs
VibeCheck

VibeCheck

Give Your Generative Models a Vibe Check 😀
NovaSky

NovaSky

Next-Generation Open Vision and AI
Sky-T1

Sky-T1

Train Your Own o1 Preview Model Within $450

multilspy

LSP client library in python to build applications around language servers
Spatialyze

Spatialyze

A New Framework for End-to-End Querying of Geospatial Videos
Compass

Compass

Encrypted Semantic Search with High Accuracy
DSPy

DSPy

The Framework for Programming—Not Prompting—Foundation Models
LOTUS

LOTUS

Easily Build Knowledge-Intensive LLM Applications That Reason Over Your Data With LOTUS!

S-LoRA

Serving Thousands of Concurrent LoRA Adapters

RouteLLM

A Framework for Serving and Evaluating LLM Routers – Save LLM Costs Without Compromising Quality!

Stylus

Automatic Adapter Selection for Diffusion Models
VideoArena

VideoArena

The First Dynamic Leaderboard for SOTA Text-To-Video Generation Models
Gorilla OpenFunctions

Gorilla OpenFunctions

Elevating LLM Function Calling with Versatile API Integration
Berkeley Function-Calling Leaderboard

Berkeley Function-Calling Leaderboard

Measuring Function-Calling Capabilities of Different LLMs
R2E

R2E

A Dynamic Framework for Evaluating AI Coding Systems
Rollbaccine

Rollbaccine

A General Solution to Rollback Attacks in TEEs
Auto-Whittaker

Auto-Whittaker

Automatically Rewriting Distributed Protocols for Scalability

Scrooge

Enabling Replicated State Machines to Communicate Efficiently

SVR3

Secret Key Recovery in a Global-Scale End-to-End Encryption System
Skydentity

Skydentity

Let Orchestrators Run Your Workloads on Your Cloud Resources Without Handing Over Your Cloud Credentials and Data
Flock

Flock

A Framework for Deploying On-Demand Distributed Trust
POET

POET

Training Neural Networks on Tiny Devices with Integrated Rematerialization and Paging

SkyPIE

A Fast & Accurate Oracle for Object Placement
LiveCodeBench

LiveCodeBench

Holistic and Contamination Free Evaluation of Large Language Models for Code
Skyplane

Skyplane

Blazing Fast Bulk Data Transfers Between Any Cloud
RAFT

RAFT

“Retrieval-Augmented Fine-Tuning” combines the benefits of Retrieval-Augmented Generation and Fine-Tuning for better domain adaptation
Arena Hard

Arena Hard

An Automatic Pipeline to Build High-Quality LLM Benchmarks with High Separability and Agreement to Human Preference from Live Data
SGLang

SGLang

A Fast Serving Framework For Large Language Models and Vision Language Models
vLLM

vLLM

Building the fastest and easiest-to-use inference engine for LLMs
Vicuna

Vicuna

An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality
Chatbot Arena

Chatbot Arena

An Open Platform for Evaluating LLMs by Human Preference
GoEx

GoEx

A Runtime for LLM-Generated Actions like Code, API Calls, and More.
Embarcadero

Embarcadero

A Totally Ordered, High Throughput, Pub/Sub System with Disaggregated Memory