Skip to content

huggingface/upskill

Repository files navigation

upskill_banner

UPskill

Generate and evaluate agent skills based on traces with agents. Create skills with teacher models (expensive/slow) that student models (cheap/fast) can use to perform harder tasks reliably.

Quick Start

Install upskill:

pip install upskill
# or just use uv
uvx upskill

Create a new skill

upskill generate "write good git commit messages"
# or based on previous agent traces
upskill generate "document the pattern" --from ./trace.md
# Skills are saved to ./skills/{skill-name}/ by default

Generate a skill with a teaching model and evaluate it on a student model.

upskill generate "write good git commit messages" --model sonnet --eval-model haiku

Benchmark a set of models against a skill.

upskill eval ./skills/git-commit-messages/ -m haiku -m sonnet
# logs pretty printed to the terminal

View the results later.

upskill runs --skill git-commit-messages

Model Handling Overview

upskill uses distinct phases with explicit model roles:

  • Skill generation: create/refine SKILL.md
  • Test generation: create synthetic evaluation cases
  • Evaluation: run tests against evaluator model(s)
  • Benchmark: repeated evaluation across multiple runs/models

Model flags by command:

Command Flag Meaning
generate --model Skill generation/refinement model
generate --test-gen-model Test generation model override
generate --eval-model Optional extra cross-model eval pass
eval -m/--model Evaluation model(s) (repeatable)
eval --test-gen-model Test generation model override (when tests are generated)
benchmark -m/--model Evaluation model(s) to benchmark
benchmark --test-gen-model Test generation model override (when tests are generated)
runs / plot -m/--model Historical results filter only

upskill eval enters benchmark mode whenever you pass multiple -m values or --runs > 1. In benchmark mode, baseline comparison is always off; --no-baseline is redundant.

Commands

upskill generate

Generate a skill from a task description with automatic evaluation and refinement.

upskill generate TASK [OPTIONS]

Arguments:

  • TASK - Description of what the skill should teach

Options:

  • -e, --example - Input -> output example (can be repeated)
  • --tool - Generate from MCP tool schema (path#tool_name)
  • -f, --from PATH - Improve from existing skill dir or agent trace file (auto-detected)
  • -m, --model MODEL - Skill generation model (e.g., 'sonnet', 'haiku', 'anthropic.claude-sonnet-4-20250514')
  • --test-gen-model MODEL - Override test generation model for this run
  • -o, --output PATH - Output directory for skill
  • --no-eval - Skip evaluation and refinement
  • --eval-model MODEL - Different model to evaluate skill on
  • --runs-dir PATH - Directory for run logs (default: ./runs)
  • --log-runs / --no-log-runs - Log run data (default: enabled)

Examples:

# Basic usage
upskill generate "parse JSON Schema files"

# Make and evaluate skills for less powerful models
upskill generate "write git commits" --model sonnet --eval-model haiku

# Improve an existing skill (auto-detected as directory)
upskill generate "add more error handling examples" --from ./skills/api-errors/

# Generate from an agent trace file (auto-detected as file)
upskill generate "document the pattern" --from ./trace.json

# Skip evaluation during generation (evaluate separately with upskill eval)
upskill generate "parse YAML" --no-eval

Output:

Generating skill with sonnet...
Generating test cases...
Evaluating on sonnet... (attempt 1)
  60% -> 100% (+40%) OK

  git-commit-messages
  Write clear, conventional commit messages that follow best practices.

  SKILL.md              ~450 tokens

  baseline   ████████████░░░░░░░░   60%
  with skill ████████████████████  100%  (+40%)

  tokens: 1200 → 800  (-33%)

Saved to ./skills/git-commit-messages

upskill eval

Evaluate an existing skill against test cases. Supports single-model evaluation with baseline comparison, or multi-model benchmarking.

upskill eval SKILL_PATH [OPTIONS]

Arguments:

  • SKILL_PATH - Path to skill directory containing SKILL.md

Options:

  • -t, --tests PATH - Test cases JSON file
  • -m, --model MODEL - Model(s) to evaluate against (repeatable for multi-model benchmarking)
  • --test-gen-model MODEL - Override test generation model when tests must be generated
  • --runs N - Number of runs per model (default: 1)
  • --no-baseline - Skip baseline comparison (simple eval mode only; ignored in benchmark mode)
  • -v, --verbose - Show per-test results
  • --log-runs / --no-log-runs - Log run data (default: enabled)
  • --runs-dir PATH - Directory for run logs

Examples:

# Basic evaluation with baseline comparison
upskill eval ./skills/my-skill/

# With verbose output
upskill eval ./skills/my-skill/ -v

# Custom test cases
upskill eval ./skills/my-skill/ --tests ./tests.json

# Evaluate on specific model
upskill eval ./skills/my-skill/ -m haiku

# Multi-model benchmarking (compare models)
upskill eval ./skills/my-skill/ -m haiku -m sonnet

# Multiple runs per model for statistical significance
upskill eval ./skills/my-skill/ -m haiku -m sonnet --runs 5

# Evaluate a local model configured in fast-agent
upskill eval ./skills/my-skill/ -m generic.my-model

# Skip baseline (just test with skill)
upskill eval ./skills/my-skill/ --no-baseline

# Benchmark mode is triggered by multiple models OR --runs > 1
upskill eval ./skills/my-skill/ -m haiku --runs 5

# Disable run logging
upskill eval ./skills/my-skill/ --no-log-runs

Benchmark output:

Evaluating my-skill across 2 model(s)
  3 test case(s), 5 run(s) per model

haiku
  Pass rate: 4/5 (80%)  Avg assertions: 2.8/3

sonnet
  Pass rate: 5/5 (100%)  Avg assertions: 3.0/3

┏━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Model  ┃ Pass Rate ┃ Avg Assertions ┃ Avg Tokens ┃
┡━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ haiku  │ 4/5       │ 2.8/3          │ 1250       │
│ sonnet │ 5/5       │ 3.0/3          │ 1890       │
└────────┴───────────┴────────────────┴────────────┘

Test cases JSON format:

[
  {"input": "Write a commit for adding login", "expected": {"contains": ["feat", "login"]}},
  {"input": "Fix the null pointer bug", "expected": {"contains": ["fix", "bug"]}}
]

upskill list

List all generated skills in a tree view.

upskill list [OPTIONS]

Options:

  • -d, --dir PATH - Skills directory to list
  • -v, --verbose - Show skill contents preview

Examples:

# List skills in default directory
upskill list

# List from custom directory
upskill list -d ./my-skills/

# Show preview of skill contents
upskill list -v

Output:

./skills
├── git-commit-messages
│   ├── Write clear, conventional commit messages...
│   └── files
│       └── SKILL.md
├── api-error-handling
│   ├── Handle API errors gracefully with proper logging...
│   └── files
│       ├── SKILL.md
│       └── references/error-codes.md
└── yaml-parsing
    ├── Parse YAML files safely with schema validation...
    └── files
        ├── SKILL.md
        └── scripts/validate.py

upskill runs

View run results as a plot, or export to CSV. By default, shows a visual comparison of baseline vs with-skill performance.

upskill runs [OPTIONS]

Options:

  • -d, --dir PATH - Runs directory
  • -s, --skill TEXT - Filter by skill name(s) (repeatable)
  • -m, --model TEXT - Filter historical run data by model(s) (repeatable)
  • --metric [success|tokens] - Metric to display (default: success)
  • --csv PATH - Export to CSV instead of plot

Examples:

# View results plot (default)
upskill runs

# Filter by skill and models
upskill runs -s my-skill -m haiku -m sonnet

# Show token usage instead of success rate
upskill runs --metric tokens

# Export to CSV
upskill runs --csv ./results.csv

# Custom runs directory
upskill runs -d ./my-runs/

Plot output:

skill: git-commit-messages

haiku
  baseline   ████████████░░░░░░░░   60%
  with skill ████████████████░░░░   80%  (+20%)

sonnet
  baseline   ████████████░░░░░░░░   60%
  with skill ████████████████████  100%  (+40%)

Matrix view (multiple skills and models):

┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━┓
┃ skill               ┃ haiku        ┃ sonnet       ┃
┡━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━┩
│ git-commit-messages │ 60%→80%      │ 60%→100%     │
│ api-error-handling  │ 40%→70%      │ 50%→90%      │
│ yaml-parsing        │ 70%→90%      │ 80%→100%     │
└─────────────────────┴──────────────┴──────────────┘

Skill Output Format

Skills are saved in a standard directory format:

./skills/{skill-name}/
├── SKILL.md          # Main skill instructions
├── references/       # Supporting documents (optional)
└── scripts/          # Executable scripts (optional)

Example SKILL.md:

# git-commit-messages

Write clear, conventional commit messages that follow best practices.

## Instructions

This skill teaches how to write effective git commit messages
following the Conventional Commits specification.

## Format

Commit messages should follow this structure:

<type>(<scope>): <subject>

<body>

<footer>

## Types

- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
...

## Examples

### Simple feature commit
feat(auth): add password reset functionality

### Bug fix with explanation
fix(api): handle null response from user service

The user service can return null when not found.
Added proper null checking to prevent crashes.

Closes #123

Run Logging

By default, upskill logs all runs to ./runs/. Each run creates:

./runs/
├── 2025_01_21_15_30/           # Batch folder (timestamp)
│   ├── run_1/
│   │   ├── run_metadata.json   # Model, task, timing
│   │   └── run_result.json     # Pass/fail, assertions, tokens
│   ├── run_2/
│   │   └── ...
│   └── batch_summary.json      # Aggregate results
└── results.csv                 # Summary CSV (after `upskill runs`)

Disable with --no-log-runs.

Configuration

upskill config (./upskill.config.yaml)

skill_generation_model: sonnet   # Default skill generation model
eval_model: haiku               # Default evaluation model (optional)
test_gen_model: null            # Optional test generation model
skills_dir: ./skills            # Where to save skills
runs_dir: ./runs                # Where to save run logs
max_refine_attempts: 3          # Refinement iterations

test_gen_model fallback behavior:

  • CLI --test-gen-model overrides config for a single run.
  • If set, test generation uses test_gen_model.
  • If unset, test generation falls back to skill_generation_model.
  • For eval/benchmark, this intentionally uses skill_generation_model (not eval_model) so generated tests stay stable when sweeping multiple evaluation models.

Backward compatibility: model is still accepted in config files as a legacy alias for skill_generation_model.

Config lookup order:

  1. UPSKILL_CONFIG environment variable (path)
  2. ./upskill.config.yaml (project local)
  3. ~/.config/upskill/config.yaml (legacy fallback)

FastAgent config (fastagent.config.yaml)

Place in your project directory to customize FastAgent settings:

default_model: sonnet

logger:
  progress_display: true
  show_chat: false
  streaming: markdown

# MCP servers (optional)
mcp:
  servers:
    fetch:
      command: "uvx"
      args: ["mcp-server-fetch"]

Environment Variables

# Required for Anthropic models
ANTHROPIC_API_KEY=sk-ant-...

# Required for OpenAI models
OPENAI_API_KEY=sk-...

# Optional: custom endpoints
ANTHROPIC_BASE_URL=http://localhost:8080
OPENAI_API_BASE=http://localhost:11434/v1

# For local models (generic provider)
GENERIC_BASE_URL=http://localhost:8080/v1
GENERIC_API_KEY=local  # Optional, defaults to "local"

Python API

from upskill import (
    generate_skill,
    generate_tests,
    evaluate_skill,
    refine_skill,
    Config,
)

# Load configuration
config = Config.load()

# Generate a skill
skill = await generate_skill(
    "parse JSON Schema files",
    model="sonnet",
    config=config,
)

# Generate test cases
tests = await generate_tests("parse JSON Schema files")

# Evaluate the skill
results = await evaluate_skill(
    skill,
    tests,
    model="haiku",
    config=config,
)

print(f"Skill lift: {results.skill_lift:.0%}")
print(f"Token savings: {results.token_savings:.0%}")
print(f"Is beneficial: {results.is_beneficial}")

# Refine based on failures
if not results.is_beneficial:
    from upskill.evaluate import get_failure_descriptions
    failures = get_failure_descriptions(results)
    improved_skill = await refine_skill(skill, failures)

Model Format

upskill uses FastAgent model format:

<provider>.<model>.<reasoning_effort?>

Examples:

  • sonnet - Anthropic Claude Sonnet (alias)
  • haiku - Anthropic Claude Haiku (alias)
  • opus - Anthropic Claude Opus (alias)
  • anthropic.claude-sonnet-4-20250514 - Full model name
  • openai.gpt-4.1 - OpenAI GPT-4.1
  • openai.o3-mini.low - OpenAI o3-mini with low reasoning effort
  • generic.llama3.2:latest - Local model via Ollama
  • generic.my-model - Local model via llama.cpp or other OpenAI-compatible server

Local Models

upskill supports local models through any OpenAI-compatible endpoint (Ollama, llama.cpp, vLLM, etc.).

Quick start with Ollama:

# Start Ollama (default port 11434)
ollama serve

# Configure endpoint via fast-agent config/env, then evaluate
upskill eval ./skills/my-skill/ --model generic.llama3.2:latest

With llama.cpp server:

# Start llama.cpp server
./llama-server -m model.gguf --port 8080

# Configure endpoint via fast-agent config/env, then evaluate
upskill eval ./skills/my-skill/ --model generic.my-model

About

Generate and evaluate agent skills for code agents like Claude Code, Open Code, OpenAI Codex

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors