Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.
Table of Contents
- Project Objectives and Scope
- Core Concept: From Automation to Autonomy
- Technology Stack and Versions
- Project Setup
- Core Component Implementation
- The 3-Loop Verification Strategy: State Machine Design
- Prompt Engineering
- Integration Layer Implementation
- Evaluation Framework and Metrics
- Adoption Plan and Milestones
- Risk Assessment and Mitigation
- Cost Estimation
- Decision Checkpoints
1. Project Objectives and Scope
1.1 The Problem
In large-scale legacy Java projects, we face the following testing dilemmas:
| Issue | Current Status | Impact |
|---|---|---|
| Insufficient Coverage | Unit Test coverage ~60%, E2E tests only cover Happy Paths | Frequent edge-case bugs in production |
| Fragile Scripts | Frontend DOM changes break 30% of Selenium tests | 2-3 days spent fixing tests after every UI update |
| Inefficient Diagnosis | Avg. 2 hours to locate root cause after failure | Developer time wasted on debugging |
| Flaky Tests | ~15% of tests fail intermittently | Low confidence in CI/CD; frequent manual re-runs |
1.2 Project Goals
Phase 1 Goal (PoC, 8 Weeks):
- Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
- Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.
Phase 2 Goal (MVP, 12 Weeks):
- Implement Visual Location capabilities to reduce test breakage from DOM changes.
- Target: Increase test script survival rate from 70% to 95%.
Phase 3 Goal (Production, 16 Weeks):
- Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
- Target: New bugs discovered > 10 per month.
1.3 Out of Scope
- Load Testing / Performance Testing
- Security Penetration Testing
- Mobile App Testing (Web only)
- Replacing existing Unit and Integration Tests
2. Core Concept: From Automation to Autonomy
2.1 Traditional Automation vs. AI Autonomy
2.2 The 3-Loop Verification Concept
3. Technology Stack and Versions
3.1 Core Stack
| Component | Tech Choice | Version | Rationale |
|---|---|---|---|
| LLM Orchestration | LangChain4j | 0.35.0 | Java-native, excellent Spring Boot integration, robust Tool Calling |
| LLM Model | GPT-4o | 2024-08-06 | Superior vision, stable reasoning, best Function Calling support |
| Backup Model | GPT-4o-mini | 2024-07-18 | Lower cost, used for simple judgments |
| Browser Automation | Playwright | 1.48.0 | More stable than Selenium, multi-browser support, official Java API |
| Test Containers | Testcontainers | 1.20.3 | Database isolation, consistent environment |
| Observability | Micrometer + OTLP | 1.13.0 | Spring Boot integration, TraceId propagation support |
3.2 Dependency Compatibility Matrix
Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│ └── langchain4j-open-ai 0.35.0
│ └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│ └── playwright-java 1.48.0
└── Testcontainers 1.20.3
└── postgresql 1.20.34. Project Setup
4.1 Maven pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example</groupId>
<artifactId>ai-qa-agent</artifactId>
<version>1.0.0-SNAPSHOT</version>
<packaging>jar</packaging>
<parent>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-parent</artifactId>
<version>3.3.5</version>
<relativePath/>
</parent>
<properties>
<java.version>21</java.version>
<langchain4j.version>0.35.0</langchain4j.version>
<playwright.version>1.48.0</playwright.version>
<testcontainers.version>1.20.3</testcontainers.version>
</properties>
<dependencies>
<!-- Spring Boot -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-web</artifactId>
</dependency>
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-actuator</artifactId>
</dependency>
<!-- LangChain4j -->
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-spring-boot-starter</artifactId>
<version>${langchain4j.version}</version>
</dependency>
<dependency>
<groupId>dev.langchain4j</groupId>
<artifactId>langchain4j-open-ai</artifactId>
<version>${langchain4j.version}</version>
</dependency>
<!-- Playwright -->
<dependency>
<groupId>com.microsoft.playwright</groupId>
<artifactId>playwright</artifactId>
<version>${playwright.version}</version>
</dependency>
<!-- Testcontainers -->
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>testcontainers</artifactId>
<version>${testcontainers.version}</version>
</dependency>
<dependency>
<groupId>org.testcontainers</groupId>
<artifactId>postgresql</artifactId>
<version>${testcontainers.version}</version>
</dependency>
<!-- Observability -->
<dependency>
<groupId>io.micrometer</groupId>
<artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
<groupId>io.opentelemetry</groupId>
<artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>
<!-- Utilities -->
<dependency>
<groupId>org.projectlombok</groupId>
<artifactId>lombok</artifactId>
<optional>true</optional>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
</dependency>
<!-- Testing -->
<dependency>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-starter-test</artifactId>
<scope>test</scope>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.springframework.boot</groupId>
<artifactId>spring-boot-maven-plugin</artifactId>
</plugin>
<!-- Install Playwright Browsers -->
<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>exec-maven-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<id>install-playwright-browsers</id>
<phase>generate-resources</phase>
<goals>
<goal>java</goal>
</goals>
<configuration>
<mainClass>com.microsoft.playwright.CLI</mainClass>
<arguments>
<argument>install</argument>
<argument>chromium</argument>
</arguments>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>4.2 application.yml
spring:
application:
name: ai-qa-agent
langchain4j:
open-ai:
chat-model:
api-key: ${OPENAI_API_KEY}
model-name: gpt-4o
temperature: 0.1 # Stability is key for testing
timeout: PT60S # Vision analysis can be slow
max-retries: 3
log-requests: true
log-responses: true
ai-qa:
browser:
headless: true
viewport-width: 1280
viewport-height: 720
timeout-ms: 30000
loops:
stability:
max-retries: 3
retry-delay-ms: 1000
flakiness-threshold: 0.8 # >80% success rate deemed flaky
diagnosis:
collect-screenshot: true
collect-console-logs: true
collect-network-logs: true
max-log-lines: 500
exploration:
max-depth: 10
max-actions-per-page: 20
cost:
budget-per-test-usd: 0.50
budget-per-day-usd: 100.00
reporting:
output-dir: ./test-reports
screenshot-format: png
# Target System Configuration
target:
base-url: ${TARGET_BASE_URL:http://localhost:8080}
api-base-url: ${TARGET_API_URL:http://localhost:8080/api}
# Actuator (For collecting backend logs in Diagnosis Loop)
management:
endpoints:
web:
exposure:
include: health,info,loggers,trace
tracing:
sampling:
probability: 1.05. Core Component Implementation
5.1 OpenAI Configuration
package com.example.aiqaagent.config;
import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;
import java.time.Duration;
@Configuration
public class OpenAiConfig {
@Value("${langchain4j.open-ai.chat-model.api-key}")
private String apiKey;
/**
* Primary Model: GPT-4o, for complex reasoning and vision.
*/
@Bean
public ChatLanguageModel primaryChatModel() {
return OpenAiChatModel.builder()
.apiKey(apiKey)
.modelName("gpt-4o")
.temperature(0.1)
.timeout(Duration.ofSeconds(60))
.maxRetries(3)
.logRequests(true)
.logResponses(true)
.build();
}
/**
* Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
*/
@Bean
public ChatLanguageModel lightweightChatModel() {
return OpenAiChatModel.builder()
.apiKey(apiKey)
.modelName("gpt-4o-mini")
.temperature(0.1)
.timeout(Duration.ofSeconds(30))
.maxRetries(3)
.build();
}
}5.2 Playwright Config & Lifecycle
package com.example.aiqaagent.config;
import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;
@Slf4j
@Configuration
public class PlaywrightConfig {
@Value("${ai-qa.browser.headless:true}")
private boolean headless;
// ... viewport properties ...
private Playwright playwright;
private Browser browser;
@Getter
private volatile BrowserContext currentContext;
@Getter
private volatile Page currentPage;
public synchronized void initialize() {
if (playwright == null) {
log.info("Initializing Playwright...");
playwright = Playwright.create();
browser = playwright.chromium().launch(
new BrowserType.LaunchOptions()
.setHeadless(headless)
);
log.info("Playwright initialized successfully");
}
}
public Page createNewPage() {
initialize();
if (currentContext != null) currentContext.close();
currentContext = browser.newContext(
new Browser.NewContextOptions().setViewportSize(1280, 720)
);
currentPage = currentContext.newPage();
// Setup listeners
currentPage.onConsoleMessage(msg ->
log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
);
return currentPage;
}
public String captureScreenshotBase64() {
if (currentPage == null) throw new IllegalStateException("No active page");
byte[] screenshot = currentPage.screenshot();
return java.util.Base64.getEncoder().encodeToString(screenshot);
}
// ... cleanup methods ...
}5.3 Browser Tools
package com.example.aiqaagent.tools;
import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...
@Component
@RequiredArgsConstructor
public class BrowserTools {
private final PlaywrightConfig playwrightConfig;
@Tool("Open specified URL. Returns page title.")
public String navigateTo(String url) {
Page page = playwrightConfig.getCurrentPage();
page.navigate(url);
page.waitForLoadState(LoadState.NETWORKIDLE);
return "Loaded page: " + page.title();
}
@Tool("Click button/link containing text. Matches exact or partial.")
public String clickByText(String text) {
Page page = playwrightConfig.getCurrentPage();
try {
Locator locator = page.getByText(text);
locator.first().waitFor();
locator.first().click();
return "Clicked element containing: " + text;
} catch (TimeoutError e) {
return "Could not find clickable element with text: " + text;
}
}
@Tool("Get interactive elements (buttons, links, inputs) on current page.")
public String getInteractiveElements() {
// Implementation to scan DOM and return list of interactive elements
// Returns formatted string list
return "...";
}
// ... other tools like fillInput, scroll, pressKey ...
}5.4 Vision Tools
package com.example.aiqaagent.tools;
import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...
@Component
@RequiredArgsConstructor
public class VisionTools {
private final PlaywrightConfig playwrightConfig;
@Qualifier("primaryChatModel")
private final ChatLanguageModel visionModel;
private static final String VISION_LOCATE_PROMPT = """
You are a Vision Analysis Assistant.
Task: Locate the element matching this description in the screenshot: %s
Return CENTER coordinates:
COORDINATES: x=123, y=456
If not found:
NOT_FOUND: Reason
""";
@Tool("Locate element using Vision AI based on visual description.")
public String clickByVision(String visualDescription) {
String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
UserMessage msg = UserMessage.from(
ImageContent.from(screenshotBase64, "image/png"),
TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
);
String response = visionModel.generate(msg).content().text();
// Parse coordinates and click using Playwright
// ... implementation ...
return "Clicked via vision at " + response;
}
}5.5 Diagnostic & Data Tools
(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)
6. The 3-Loop Verification Strategy: State Machine Design
6.1 Loop Logic
Stability Loop: Handles flaky tests.
- If an action fails with transient errors (Timeout, 503), retry N times.
- If success rate < 100% but > 0%, mark as “Flaky” but Passed.
Diagnosis Loop: Handles hard failures.
- Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
- Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
Exploration Loop: Handles path planning.
- Determines next action based on Goal and Page State.
- Uses RL-like approach to maximize coverage of unknown paths.
7. Prompt Engineering
7.1 System Prompt
You are a Senior QA Automation Engineer.
Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.
Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.7.2 Diagnosis Prompt
Analyze this test failure.
Action: {action_description}
Error: {error_message}
Evidence: {evidence}
Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level8. Integration Layer Implementation
8.1 AutonomousTester Interface
public interface AutonomousTester {
@SystemMessage("...")
void initialize();
@UserMessage("Test Goal: {{goal.description}}")
TestReport performTest(@MemoryId String testId, TestGoal goal);
@Tool("Execute atomic browser action")
ActionResult executeAction(TestAction action);
}8.2 TestOrchestratorService
Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.
9. Evaluation Framework and Metrics
| Metric | Phase 1 (PoC) | Phase 2 (MVP) | Phase 3 (Prod) |
|---|---|---|---|
| Diagnosis Accuracy | > 80% | > 90% | > 95% |
| Script Survival Rate | N/A | > 95% | > 99% |
| New Bugs Found | N/A | N/A | > 10/month |
10. Adoption Plan and Milestones
Phase 1: Diagnosis Assistant (8 Weeks)
- Goal: AI analyzes failure logs from existing CI pipelines.
- Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.
Phase 2: Visual & Self-Healing (12 Weeks)
- Goal: Implement VisionTools and Self-Healing locators.
- Deliverable: A test suite that survives major UI refactoring without manual fixes.
Phase 3: Autonomous Exploration (16 Weeks)
- Goal: Full “Nightly Build” exploration.
- Deliverable: Autonomous testing of core business flows with minimal human input.
11. Risk Assessment and Mitigation
| Risk | Assessment | Mitigation Strategy |
|---|---|---|
| High Token Cost | High | 1. Prioritize GPT-4o-mini or local LLMs for simple tasks. 2. Optimize Prompts to reduce token usage. 3. Implement strict cost monitoring and budget caps. 4. Use GPT-4o only for core validation and complex diagnosis. |
| AI Hallucination / Misjudgment | Medium | 1. Increase clarity and professionalism of System Prompts. 2. Introduce human-in-the-loop review for initial training (RLHF). 3. Set confidence thresholds; low-confidence judgments require manual intervention. |
| Flaky Test Results | Medium | 1. The Stability Loop itself is a mitigation measure. 2. Optimize synchronization between Playwright screenshots and Vision analysis. 3. Provide full evidence (video, screenshots, logs) for human verification. |
| Data Privacy | Medium | 1. Strictly prohibit sending PII/sensitive production data to LLM APIs. 2. Anonymize test data. 3. Consider enterprise-grade solutions (e.g., Azure OpenAI Service) or local LLMs. |
| Long Implementation Time | Medium | 1. Adopt a phased approach (PoC, MVP, Production). 2. Set clear acceptance criteria for each milestone. 3. Ensure early involvement and feedback from developers and QA. |
12. Cost Estimation
12.1 OpenAI API Costs
Assumptions:
- GPT-4o Vision analysis per screenshot: ~$0.05
- GPT-4o complex reasoning (diagnosis, planning): ~$0.03
- GPT-4o-mini simple judgment: ~$0.001
| Scenario | Est. Runs / Day | Unit Cost | Daily Cost |
|---|---|---|---|
| Diagnosis Loop | 50 | $0.03 | $1.50 |
| Exploration Loop | 200 | $0.03 | $6.00 |
| Visual Location | 1000 | $0.05 | $50.00 |
| GPT-4o-mini | 5000 | $0.001 | $5.00 |
| Total | $62.50 |
Estimated Monthly Cost (Production): $62.50 * 22 working days = $1,375 USD
12.2 Human Resource Costs (PoC Phase – 8 Weeks)
| Role | Man-Months | Monthly Salary | Total |
|---|---|---|---|
| Senior Java Dev (AI Specialty) | 2 | $5,000 | $10,000 |
| Senior QA (Req. Definition) | 0.5 | $3,500 | $1,750 |
| Total | $11,750 USD |
12.3 Infrastructure Costs
Existing dev machines suffice for PoC. Production will require additional Docker Host or K8s node resources.
13. Decision Checkpoints
- After 8 Weeks (End of PoC): Is Diagnosis Accuracy > 80%? Is Diagnosis Time < 30s? If not, pivot or stop.
- After 20 Weeks (End of MVP): Is Script Survival Rate > 95%? Does the AI Co-pilot significantly boost QA efficiency? Is token cost within budget?
- After 36 Weeks (Production): Has the number of new bugs found significantly increased? Is coverage reaching 80%? Has QA focus successfully shifted from execution to strategy?
Resources should only be committed further if milestones are met at these checkpoints, gradually scaling the AI Autonomous Testing to more business modules.