Scarf (short for Self-Contained Application Refactoring) benchmark is a suite of Java applications across frameworks: Jakarta EE, Quarkus, and Spring for evaluating agentic transformation between the frameworks. This suite enables systematic assessment of AI agents' ability to migrate enterprise Java applications while preserving functionality, idiomatic patterns, and architectural integrity across different runtime environments.
The benchmark includes comprehensive examples ranging from focused layer-specific demonstrations to complete production-grade applications, each with verified implementations across all supported frameworks.
Note: All applications in this benchmark have been meticulously converted and verified by experienced developers. Each implementation has undergone rigorous testing to ensure functional correctness, adherence to framework-specific idioms, and preservation of architectural integrity across Jakarta EE, Quarkus, and Spring frameworks.
This benchmark suite comes with most things needed to run the benchmark applications. Everything is already set up!
Each application comes with:
- Dockerfile - Pre-configured container with all dependencies installed
- justfile - Simple commands to build and run everything
- smoke.py or smoke/ - Automated tests to verify the application works
You don't need to install Maven, Java, or any dependencies. Docker handles it all!
You only need:
- Docker installed on your machine
- Just command runner (you can install it via Cargo or your package manager)
Browse the directory structure and choose any application. For example:
business_domain/counter/spring/
dependency_injection/encoder/jakarta/
presentation/mood/quarkus/
cd business_domain/counter/springjust upThat's it! The just up command will:
- Build your application
- Build the Docker container
- Start everything up
just logsjust downEvery application supports these commands (via the justfile):
| Command | What it does |
|---|---|
just |
Shows all available commands |
just up |
Builds and starts the application |
just down |
Stops the application |
just logs |
Shows application logs |
just build |
Builds the application (Maven) |
just docker-build |
Builds the Docker image |
just clean |
Removes build artifacts |
Most applications include automated tests. To run them:
# If smoke.py exists
python3 smoke.py
# If smoke/ folder exists
cd smoke && ./verify-all.shEach application type comes in three flavors:
- jakarta/ - Jakarta EE (enterprise Java)
- quarkus/ - Quarkus (cloud-native Java)
- spring/ - Spring Boot (popular Java framework)
Pick whichever framework you want to test!
Port already in use?
just down
# Wait a few seconds
just upWant to rebuild from scratch?
just clean
just docker-build
just upNeed to see what's happening?
just logsThis benchmark contains self-contained applications demonstrating core Java EE functionalities and their framework-specific implementations. Each example has been manually converted and verified across all target frameworks, with smoke tests included to verify application behavior after transformation.
The benchmark includes two types of examples:
Application examples organized per layer, where each example demonstrates a specific technology within that layer (e.g., persistence, presentation, integration).
Core business logic implementations using Enterprise JavaBeans (EJBs). Demonstrates stateful, stateless, and singleton session beans for shopping carts, currency conversion, hit counters, web services, and standalone EJB usage.
Examples:
- cart - Stateful session bean with shopping cart lifecycle management and
@Removemethods - converter - Stateless session bean demonstrating currency conversion business logic
- counter - Singleton session bean with shared state for tracking web page hits
- helloservice - JAX-WS web service implemented as a stateless session bean
- standalone - Stateless session bean for standalone EJB container usage
CDI and dependency injection patterns including custom qualifiers, interceptors, decorators, producer methods, event observers, and alternative implementations for conditional bean selection.
Enterprise features including managed executors for concurrency, asynchronous EJB methods, interceptors for cross-cutting concerns, and timer services for scheduled task execution.
Integration technologies featuring Jakarta Batch processing, JMS messaging patterns, message-driven beans, JAX-WS web services, and Java Connector Architecture for enterprise system integration.
Data persistence patterns using JPA entities with CRUD operations, complex entity relationships, composite keys, inheritance strategies, and JPQL queries for database interactions.
Web tier implementations including servlets, JAX-RS REST APIs, WebSocket endpoints, server-sent events, file uploads, filters, listeners, and real-time communication patterns.
Authentication and authorization patterns featuring Jakarta Security identity stores, form-based and basic authentication, EJB security, role-based access control, and password hashing.
Complete, functioning applications that demonstrate the coordination and interaction between multiple layers.
Domain-Driven Design cargo shipping tracker with Jakarta Faces, CDI, Enterprise Beans, JPA, REST, Batch, and JMS. Showcases aggregates, repositories, and domain events following Eric Evans' DDD patterns.
Demonstrates Jakarta Faces, CDI, Enterprise Beans, JPA, REST, Batch, JSON Binding, Bean Validation, and JMS. Showcases end-to-end application architecture with multiple interfaces (web UI, REST API, file scanning) and complex domain modeling including aggregates, repositories, and domain events. Implements the cargo tracking example from Eric Evans' DDD book.
Event-driven microservices with Orders, Barista, and Kitchen services via Kafka. Demonstrates MicroProfile stack, reactive messaging, distributed transactions, and eventual consistency.
Microservices architecture with Orders, Barista, and Kitchen services communicating via Apache Kafka. Demonstrates MicroProfile (Config, Health, OpenAPI, Metrics), JPA with PostgreSQL, JAX-RS REST APIs, reactive messaging patterns, and distributed transaction coordination. Shows event-driven architecture with asynchronous inter-service communication and eventual consistency.
High-performance stock trading benchmark with stateless session beans, JPA optimistic locking, transaction management, and connection pooling. Used for measuring server performance.
Online stock trading benchmark application demonstrating real-world Java EE workload patterns. Implements user authentication, portfolio management, stock quote lookup, and buy/sell transactions. Showcases performance-oriented design with stateless session beans, JPA entities with optimistic locking, transaction management, connection pooling, and web service interfaces.
Veterinary clinic management with Jakarta Faces (PrimeFaces), complex JPA relationships, CDI, and Bean Validation. Complete workflows for owners, pets, visits, and veterinarians.
Full-featured veterinary clinic management system using Jakarta Faces (PrimeFaces) for the UI layer. Demonstrates CRUD operations with JPA entities showing one-to-many, many-to-one, and many-to-many relationships (owners-pets, pets-visits, vets-specialties). Includes CDI beans, Bean Validation, JSF navigation, complex forms, and master-detail views.
Medium.com clone with MicroProfile JWT, JAX-RS REST API, article management, comments, favorites, tags, and user following. Includes Testcontainers integration tests.
Medium.com clone (Conduit) implementing the RealWorld specification with full CRUD operations, JWT authentication, article management, comments, favorites, tags, and user following. Demonstrates MicroProfile JWT, JAX-RS REST API design, JPA with PostgreSQL, password hashing (BCrypt), slug generation, pagination, filtering, and comprehensive exception handling. Includes integration tests with Testcontainers and MicroShed testing framework.
ScarfBench is actively maintained and continuously evolving to support the research community. We are committed to expanding the benchmark's capabilities and improving its utility for evaluating AI-driven application transformation. Here's what's coming:
We are developing an extensive suite of automated smoke tests to validate functional equivalence across framework migrations. These tests will ensure that transformed applications maintain their original behavior, catching subtle regressions and framework-specific issues that may arise during migration.
A live leaderboard will track and compare the performance of different AI agents and transformation tools across the benchmark suite. This will provide transparent, reproducible metrics for the research community and help identify best practices in automated application migration.
The leaderboard will provide:
- Performance Metrics: Track success rates, compilation success, test pass rates, and functional equivalence scores
- Agent Comparison: Side-by-side comparison of different AI agents and transformation tools
- Framework-Specific Results: Detailed breakdown of performance across Jakarta EE, Quarkus, and Spring migrations
- Application Categories: Results organized by focused examples vs. whole applications
- Transparency: Reproducible metrics and open evaluation methodology
- Community Contributions: Submit your own agent results for inclusion
We are building a comprehensive taxonomy that categorizes transformation errors, anti-patterns, and common pitfalls. This taxonomy will help researchers understand where current approaches struggle and guide development of more robust transformation strategies.
ScarfBench will continue to receive regular updates with new applications, enhanced documentation, and improved tooling. We welcome community contributions and feedback to make this benchmark more valuable for advancing the state of automated application transformation.
For any questions, feedback, or suggestions, or to submit your own agent results for the leaderboard, please contact the authors:
| Name | |
|---|---|
| Rahul Krishna | i.m.ralk@gmail.com |
| Raju Pavuluri | pavuluri@us.ibm.com |
If you use this benchmark in your research, please cite our paper:
[Placeholder: BibTeX citation will be added when paper is published]See LICENSE file for details.