Midscene Java is an AI-powered automation SDK that allows you to control web browsers using natural language instructions. It integrates with standard Selenium WebDriver (and Playwright) to serve as an intelligent agent layer on top of your existing test automation framework.
- Natural Language Control: "Search for 'Headphones' and click the first result."
- Advanced Interaction: Click, type, scroll, drag-and-drop, and more using simple commands.
- Multimodal Understanding: Uses screenshots to understand page context (Visual Grounding).
- Smart Planning: Automatically plans, executes, and retries actions.
- Service Layer: Low-level AI capabilities for locating, extracting, and describing elements.
- YAML Script Support: Execute declarative test scripts defined in YAML.
- Caching: Built-in caching for improved performance and reduced API costs.
- Framework Agnostic: Works with Selenium and Playwright.
- Flexible Configuration: Supports OpenAI (GPT-4o) and Google Gemini (1.5 Pro) models.
- Visual Reports: Generates detailed HTML reports with execution traces, screenshots, and reasoning.
midscene-core: The brain of the agent. ContainsAgent,Service,ScriptPlayer, and core logic.midscene-web: Adapters for browser automation tools (Selenium, Playwright).midscene-visualizer: Generates visual HTML reports from execution contexts.
Add the necessary dependencies to your project's pom.xml:
<dependency>
<groupId>io.github.alstafeev</groupId>
<artifactId>midscene-web</artifactId>
<version>0.1.9-SNAPSHOT</version>
</dependency>
<dependency>
<groupId>io.github.alstafeev</groupId>
<artifactId>midscene-visualizer</artifactId>
<version>0.1.9-SNAPSHOT</version>
</dependency>Midscene Agent is the primary way to interact with your application. It handles planning and execution.
// 1. Configure
MidsceneConfig config = MidsceneConfig.builder()
.provider(ModelProvider.GEMINI) // or OPENAI
.apiKey(System.getenv("GEMINI_API_KEY"))
.modelName("gemini-1.5-pro")
.build();
// 2. Initialize (Selenium example)
WebDriver driver = new ChromeDriver();
SeleniumDriver pageDriver = new SeleniumDriver(driver);
Agent agent = Agent.create(config, pageDriver);
// 3. Interact
agent.aiAction("Search for 'Headphones' and click the first result");
agent.aiAssert("Price should be under $200");
// 4. Generate Report
Visualizer.generateReport(agent.getContext(), Paths.get("report.html"));The Agent class provides specific methods for precise control:
// Interactions
agent.aiTap("Submit button");
agent.aiInput("Username field", "admin");
agent.aiScroll(ScrollOptions.down());
agent.aiHover("User profile icon");
// Assertions & Waist
agent.aiAssert("The login button should be visible");
agent.aiWaitFor("Welcome message to appear");
// Data Query
String price = agent.aiString("What is the price of the first item?");
boolean isLoggedIn = agent.aiBoolean("Is the user logged in?");Use the Service class for direct AI tasks without full agent planning:
Service service = new Service(pageDriver, agent.getAiModel());
// Locate element coordinates
LocateResult result = service.locate("The blue checkout button");
System.out.println("Button at: " + result.getRect());
// Extract data
ExtractResult<String> price = service.extract("Price of the main item");
// Describe element
DescribeResult desc = service.describe(100, 200); // describe item at x=100, y=200Define test flows declaratively in YAML:
target:
url: "https://saucedemo.com"
tasks:
- name: "Login Flow"
flow:
- aiAction: "Type 'standard_user' into username field"
- aiAction: "Type 'secret_sauce' into password field"
- aiAction: "Click Login"
- aiAssert: "User should be on the inventory page"
- logScreenshot: "Inventory Page"Run it with Java:
ScriptPlayer player = new ScriptPlayer("login_script.yaml", agent);
ScriptResult result = player.run();Midscene caches planning results to speed up execution and save tokens.
// Cache is enabled by default (memory + file)
// Configure cache behavior:
MidsceneConfig config = MidsceneConfig.builder()
// ...
.cacheId("my_test_cache") // persistent cache file
.build();- Selenium:
new SeleniumDriver(webDriver) - Playwright:
new PlaywrightDriver(page)
Detailed configuration options:
MidsceneConfig config = MidsceneConfig.builder()
.provider(ModelProvider.OPENAI)
.apiKey("sk-...")
.modelName("gpt-4o")
.baseUrl("https://api.openai.com/v1") // optional custom base URL
.timeoutMs(120000) // AI timeout
.build();Build from source:
git clone https://github.com/alstafeev/midscene-java.git
cd midscene-java
mvn clean install