Next-Gen QA: Implementing AI-Driven Multi-Turn Autonomous Acceptance Testing in Legacy Java Projects

Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.

Project Objectives and Scope
Core Concept: From Automation to Autonomy
Technology Stack and Versions
Project Setup
Core Component Implementation
The 3-Loop Verification Strategy: State Machine Design
Prompt Engineering
Integration Layer Implementation
Evaluation Framework and Metrics
Adoption Plan and Milestones
Risk Assessment and Mitigation
Cost Estimation
Decision Checkpoints

1. Project Objectives and Scope

1.1 The Problem

In large-scale legacy Java projects, we face the following testing dilemmas:

Issue	Current Status	Impact
Insufficient Coverage	Unit Test coverage ~60%, E2E tests only cover Happy Paths	Frequent edge-case bugs in production
Fragile Scripts	Frontend DOM changes break 30% of Selenium tests	2-3 days spent fixing tests after every UI update
Inefficient Diagnosis	Avg. 2 hours to locate root cause after failure	Developer time wasted on debugging
Flaky Tests	~15% of tests fail intermittently	Low confidence in CI/CD; frequent manual re-runs

1.2 Project Goals

Phase 1 Goal (PoC, 8 Weeks):

Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.

Phase 2 Goal (MVP, 12 Weeks):

Implement Visual Location capabilities to reduce test breakage from DOM changes.
Target: Increase test script survival rate from 70% to 95%.

Phase 3 Goal (Production, 16 Weeks):

Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
Target: New bugs discovered > 10 per month.

1.3 Out of Scope

Load Testing / Performance Testing
Security Penetration Testing
Mobile App Testing (Web only)
Replacing existing Unit and Integration Tests

2. Core Concept: From Automation to Autonomy

2.1 Traditional Automation vs. AI Autonomy

1Traditional Automation (Imperative):

→

2Developer defines

→

3Click #login-btn

→

4Wait 2s

→

5Assert URL contains /dashboard

→

6Problems:

→

7Fixed paths, cannot handle unexpected situations

→

8Fragile element locators; breaks on DOM changes

2.2 The 3-Loop Verification Concept

1┌─────────────────────────────────────────────────────────────┐

→

2Exploration Loop │

→

3┌─────────────────────────────────────────────────────┐ │

→

4Diagnosis Loop │ │

→

5┌─────────────────────────────────────────────┐ │ │

→

6Stability Loop │ │ │

→

7Execute Single Test Action │ │ │

3. Technology Stack and Versions

3.1 Core Stack

Component	Tech Choice	Version	Rationale
LLM Orchestration	LangChain4j	0.35.0	Java-native, excellent Spring Boot integration, robust Tool Calling
LLM Model	GPT-4o	2024-08-06	Superior vision, stable reasoning, best Function Calling support
Backup Model	GPT-4o-mini	2024-07-18	Lower cost, used for simple judgments
Browser Automation	Playwright	1.48.0	More stable than Selenium, multi-browser support, official Java API
Test Containers	Testcontainers	1.20.3	Database isolation, consistent environment
Observability	Micrometer + OTLP	1.13.0	Spring Boot integration, TraceId propagation support

3.2 Dependency Compatibility Matrix

Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│   └── langchain4j-open-ai 0.35.0
│   └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│   └── playwright-java 1.48.0
└── Testcontainers 1.20.3
    └── postgresql 1.20.3

4. Project Setup

4.1 Maven `pom.xml`

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>ai-qa-agent</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.3.5</version>
        <relativePath/>
    </parent>

    <properties>
        <java.version>21</java.version>
        <langchain4j.version>0.35.0</langchain4j.version>
        <playwright.version>1.48.0</playwright.version>
        <testcontainers.version>1.20.3</testcontainers.version>
    </properties>

    <dependencies>
        <!-- Spring Boot -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>

        <!-- LangChain4j -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-open-ai</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>

        <!-- Playwright -->
        <dependency>
            <groupId>com.microsoft.playwright</groupId>
            <artifactId>playwright</artifactId>
            <version>${playwright.version}</version>
        </dependency>

        <!-- Testcontainers -->
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>postgresql</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>

        <!-- Observability -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-tracing-bridge-otel</artifactId>
        </dependency>
        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-exporter-otlp</artifactId>
        </dependency>

        <!-- Utilities -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </dependency>

        <!-- Testing -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
            <!-- Install Playwright Browsers -->
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <id>install-playwright-browsers</id>
                        <phase>generate-resources</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>com.microsoft.playwright.CLI</mainClass>
                            <arguments>
                                <argument>install</argument>
                                <argument>chromium</argument>
                            </arguments>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

4.2 `application.yml`

spring:
  application:
    name: ai-qa-agent

langchain4j:
  open-ai:
    chat-model:
      api-key: ${OPENAI_API_KEY}
      model-name: gpt-4o
      temperature: 0.1  # Stability is key for testing
      timeout: PT60S    # Vision analysis can be slow
      max-retries: 3
      log-requests: true
      log-responses: true

ai-qa:
  browser:
    headless: true
    viewport-width: 1280
    viewport-height: 720
    timeout-ms: 30000

  loops:
    stability:
      max-retries: 3
      retry-delay-ms: 1000
      flakiness-threshold: 0.8  # >80% success rate deemed flaky
    diagnosis:
      collect-screenshot: true
      collect-console-logs: true
      collect-network-logs: true
      max-log-lines: 500
    exploration:
      max-depth: 10
      max-actions-per-page: 20

  cost:
    budget-per-test-usd: 0.50
    budget-per-day-usd: 100.00

  reporting:
    output-dir: ./test-reports
    screenshot-format: png

# Target System Configuration
target:
  base-url: ${TARGET_BASE_URL:http://localhost:8080}
  api-base-url: ${TARGET_API_URL:http://localhost:8080/api}

# Actuator (For collecting backend logs in Diagnosis Loop)
management:
  endpoints:
    web:
      exposure:
        include: health,info,loggers,trace
  tracing:
    sampling:
      probability: 1.0

5. Core Component Implementation

5.1 OpenAI Configuration

package com.example.aiqaagent.config;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class OpenAiConfig {

    @Value("${langchain4j.open-ai.chat-model.api-key}")
    private String apiKey;

    /**
     * Primary Model: GPT-4o, for complex reasoning and vision.
     */
    @Bean
    public ChatLanguageModel primaryChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(60))
                .maxRetries(3)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    /**
     * Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
     */
    @Bean
    public ChatLanguageModel lightweightChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o-mini")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(30))
                .maxRetries(3)
                .build();
    }
}

5.2 Playwright Config & Lifecycle

package com.example.aiqaagent.config;

import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;

@Slf4j
@Configuration
public class PlaywrightConfig {

    @Value("${ai-qa.browser.headless:true}")
    private boolean headless;

    // ... viewport properties ...

    private Playwright playwright;
    private Browser browser;

    @Getter
    private volatile BrowserContext currentContext;

    @Getter
    private volatile Page currentPage;

    public synchronized void initialize() {
        if (playwright == null) {
            log.info("Initializing Playwright...");
            playwright = Playwright.create();
            browser = playwright.chromium().launch(
                new BrowserType.LaunchOptions()
                    .setHeadless(headless)
            );
            log.info("Playwright initialized successfully");
        }
    }

    public Page createNewPage() {
        initialize();
        if (currentContext != null) currentContext.close();

        currentContext = browser.newContext(
            new Browser.NewContextOptions().setViewportSize(1280, 720)
        );

        currentPage = currentContext.newPage();
        
        // Setup listeners
        currentPage.onConsoleMessage(msg ->
            log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
        );

        return currentPage;
    }

    public String captureScreenshotBase64() {
        if (currentPage == null) throw new IllegalStateException("No active page");
        byte[] screenshot = currentPage.screenshot();
        return java.util.Base64.getEncoder().encodeToString(screenshot);
    }
    
    // ... cleanup methods ...
}

5.3 Browser Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...

@Component
@RequiredArgsConstructor
public class BrowserTools {

    private final PlaywrightConfig playwrightConfig;

    @Tool("Open specified URL. Returns page title.")
    public String navigateTo(String url) {
        Page page = playwrightConfig.getCurrentPage();
        page.navigate(url);
        page.waitForLoadState(LoadState.NETWORKIDLE);
        return "Loaded page: " + page.title();
    }

    @Tool("Click button/link containing text. Matches exact or partial.")
    public String clickByText(String text) {
        Page page = playwrightConfig.getCurrentPage();
        try {
            Locator locator = page.getByText(text);
            locator.first().waitFor();
            locator.first().click();
            return "Clicked element containing: " + text;
        } catch (TimeoutError e) {
            return "Could not find clickable element with text: " + text;
        }
    }

    @Tool("Get interactive elements (buttons, links, inputs) on current page.")
    public String getInteractiveElements() {
        // Implementation to scan DOM and return list of interactive elements
        // Returns formatted string list
        return "..."; 
    }
    
    // ... other tools like fillInput, scroll, pressKey ...
}

5.4 Vision Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...

@Component
@RequiredArgsConstructor
public class VisionTools {

    private final PlaywrightConfig playwrightConfig;
    
    @Qualifier("primaryChatModel")
    private final ChatLanguageModel visionModel;

    private static final String VISION_LOCATE_PROMPT = """
        You are a Vision Analysis Assistant.
        Task: Locate the element matching this description in the screenshot: %s
        
        Return CENTER coordinates:
        COORDINATES: x=123, y=456
        
        If not found:
        NOT_FOUND: Reason
        """;

    @Tool("Locate element using Vision AI based on visual description.")
    public String clickByVision(String visualDescription) {
        String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
        
        UserMessage msg = UserMessage.from(
            ImageContent.from(screenshotBase64, "image/png"),
            TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
        );

        String response = visionModel.generate(msg).content().text();
        
        // Parse coordinates and click using Playwright
        // ... implementation ...
        
        return "Clicked via vision at " + response;
    }
}

5.5 Diagnostic & Data Tools

(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)

6. The 3-Loop Verification Strategy: State Machine Design

6.1 Loop Logic

Stability Loop: Handles flaky tests.
- If an action fails with transient errors (Timeout, 503), retry N times.
- If success rate < 100% but > 0%, mark as “Flaky” but Passed.
Diagnosis Loop: Handles hard failures.
- Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
- Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
Exploration Loop: Handles path planning.
- Determines next action based on Goal and Page State.
- Uses RL-like approach to maximize coverage of unknown paths.

7. Prompt Engineering

7.1 System Prompt

You are a Senior QA Automation Engineer.

Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.

Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.

7.2 Diagnosis Prompt

Analyze this test failure.

Action: {action_description}
Error: {error_message}
Evidence: {evidence}

Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level

8. Integration Layer Implementation

8.1 `AutonomousTester` Interface

public interface AutonomousTester {

    @SystemMessage("...")
    void initialize();

    @UserMessage("Test Goal: {{goal.description}}")
    TestReport performTest(@MemoryId String testId, TestGoal goal);

    @Tool("Execute atomic browser action")
    ActionResult executeAction(TestAction action);
}

8.2 `TestOrchestratorService`

Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.

9. Evaluation Framework and Metrics

Metric	Phase 1 (PoC)	Phase 2 (MVP)	Phase 3 (Prod)
Diagnosis Accuracy	> 80%	> 90%	> 95%
Script Survival Rate	N/A	> 95%	> 99%
New Bugs Found	N/A	N/A	> 10/month

10. Adoption Plan and Milestones

Phase 1: Diagnosis Assistant (8 Weeks)

Goal: AI analyzes failure logs from existing CI pipelines.
Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.

Phase 2: Visual & Self-Healing (12 Weeks)

Goal: Implement VisionTools and Self-Healing locators.
Deliverable: A test suite that survives major UI refactoring without manual fixes.

Phase 3: Autonomous Exploration (16 Weeks)

Goal: Full “Nightly Build” exploration.
Deliverable: Autonomous testing of core business flows with minimal human input.

11. Risk Assessment and Mitigation

Risk	Assessment	Mitigation Strategy
High Token Cost	High	1. Prioritize GPT-4o-mini or local LLMs for simple tasks. 2. Optimize Prompts to reduce token usage. 3. Implement strict cost monitoring and budget caps. 4. Use GPT-4o only for core validation and complex diagnosis.
AI Hallucination / Misjudgment	Medium	1. Increase clarity and professionalism of System Prompts. 2. Introduce human-in-the-loop review for initial training (RLHF). 3. Set confidence thresholds; low-confidence judgments require manual intervention.
Flaky Test Results	Medium	1. The Stability Loop itself is a mitigation measure. 2. Optimize synchronization between Playwright screenshots and Vision analysis. 3. Provide full evidence (video, screenshots, logs) for human verification.
Data Privacy	Medium	1. Strictly prohibit sending PII/sensitive production data to LLM APIs. 2. Anonymize test data. 3. Consider enterprise-grade solutions (e.g., Azure OpenAI Service) or local LLMs.
Long Implementation Time	Medium	1. Adopt a phased approach (PoC, MVP, Production). 2. Set clear acceptance criteria for each milestone. 3. Ensure early involvement and feedback from developers and QA.

12. Cost Estimation

12.1 OpenAI API Costs

Assumptions:

GPT-4o Vision analysis per screenshot: ~$0.05
GPT-4o complex reasoning (diagnosis, planning): ~$0.03
GPT-4o-mini simple judgment: ~$0.001

Scenario	Est. Runs / Day	Unit Cost	Daily Cost
Diagnosis Loop	50	$0.03	$1.50
Exploration Loop	200	$0.03	$6.00
Visual Location	1000	$0.05	$50.00
GPT-4o-mini	5000	$0.001	$5.00
Total			$62.50

Estimated Monthly Cost (Production): $62.50 * 22 working days = $1,375 USD

12.2 Human Resource Costs (PoC Phase – 8 Weeks)

Role	Man-Months	Monthly Salary	Total
Senior Java Dev (AI Specialty)	2	$5,000	$10,000
Senior QA (Req. Definition)	0.5	$3,500	$1,750
Total			$11,750 USD

12.3 Infrastructure Costs

Existing dev machines suffice for PoC. Production will require additional Docker Host or K8s node resources.

13. Decision Checkpoints

After 8 Weeks (End of PoC): Is Diagnosis Accuracy > 80%? Is Diagnosis Time < 30s? If not, pivot or stop.
After 20 Weeks (End of MVP): Is Script Survival Rate > 95%? Does the AI Co-pilot significantly boost QA efficiency? Is token cost within budget?
After 36 Weeks (Production): Has the number of new bugs found significantly increased? Is coverage reaching 80%? Has QA focus successfully shifted from execution to strategy?

Resources should only be committed further if milestones are met at these checkpoints, gradually scaling the AI Autonomous Testing to more business modules.

Table of Contents