This project evaluates the performance of various Vision Language Models (VLMs) in performing precise image measurement tasks on technical drawings. Specifically, it tests the models' ability to measure dimensions from piping drawings by identifying pixel coordinates, performing scale conversions, and calculating real-world measurements.
What it tests:
- Accuracy in identifying two pixel coordinates in an image and calcuating the distance between them (compares VLM output to actual values)
Evaluation Method: The system uses Promptfoo to run automated evaluations across multiple AI providers, comparing their outputs against known actual values with a defined tolerance threshold.
For detailed information about the measurement methodology, see docs/process.md.
Before getting started, ensure you have:
- Node.js (required for Promptfoo)
- API keys for at least one of the following providers:
- OpenAI (GPT models)
- Anthropic (Claude models)
- Google (Gemini models)
Follow the official installation guide:
npm install -g promptfooVerify the installation:
promptfoo --versionCreate a .env file in the project root with your API keys:
# .env file
OPENAI_API_KEY=your_openai_key_here
ANTHROPIC_API_KEY=your_anthropic_key_here
GEMINI_API_KEY=your_google_key_hereNote: If you don't have access to all providers, comment out the unavailable providers in
promptfooconfig.yaml.
The evaluation configuration is defined in promptfooconfig.yaml. Current providers being tested:
| Provider | Model |
|---|---|
| Anthropic | claude-opus-4-5 |
| Anthropic | claude-sonnet-4-5 |
| Anthropic | claude-haiku-4-5 |
measure-eval/
├── README.MD # This file
├── promptfooconfig.yaml # Evaluation configuration
├── prompt.py # Image prompt formatting for different providers
├── compare_values.py # Validation logic (work in progress)
├── .env # API keys (create this)
├── docs/
│ └── process.md # Detailed measurement process documentation
├── image/
│ └── piping-red-full-896x1344.png # Test image
└── pdf/
└── piping-red-full.pdf # Original PDF drawing
Execute the evaluation across all configured providers:
promptfoo evalWhat happens during evaluation:
- Promptfoo loads the test configuration
- For each provider, it sends the image with measurement instructions
- Models analyze the image and return measurements in JSON format
- Results are collected and can be compared across providers
After the evaluation completes, launch the interactive results viewer:
promptfoo viewThis opens a web interface where you can:
- Compare outputs from different models
- Review individual responses
- Analyze accuracy and consistency
- Export results for further analysis
The evaluation tests the VLM's ability to execute critical steps in the measurement process.
For the complete process with detailed examples, see docs/process.md.
- Configure assertion logic in
promptfooconfig.yaml - Update
compare_values.pyto match new JSON output format - Implement comprehensive accuracy metrics
- Add additional test images with varying complexity