Next-Gen QA: Implementing AI-Driven Multi-Turn Autonomous Acceptance Testing in Legacy Java Projects

Document Type: This is a technical implementation plan for evaluating the feasibility of introducing AI Autonomous Testing into legacy Java projects. The architectural design is based on current tool capabilities and is intended as a blueprint. It is recommended to perform a PoC to validate core assumptions before full-scale adoption.


Table of Contents

  1. Project Objectives and Scope
  2. Core Concept: From Automation to Autonomy
  3. Technology Stack and Versions
  4. Project Setup
  5. Core Component Implementation
  6. The 3-Loop Verification Strategy: State Machine Design
  7. Prompt Engineering
  8. Integration Layer Implementation
  9. Evaluation Framework and Metrics
  10. Adoption Plan and Milestones
  11. Risk Assessment and Mitigation
  12. Cost Estimation
  13. Decision Checkpoints

1. Project Objectives and Scope

1.1 The Problem

In large-scale legacy Java projects, we face the following testing dilemmas:

IssueCurrent StatusImpact
Insufficient CoverageUnit Test coverage ~60%, E2E tests only cover Happy PathsFrequent edge-case bugs in production
Fragile ScriptsFrontend DOM changes break 30% of Selenium tests2-3 days spent fixing tests after every UI update
Inefficient DiagnosisAvg. 2 hours to locate root cause after failureDeveloper time wasted on debugging
Flaky Tests~15% of tests fail intermittentlyLow confidence in CI/CD; frequent manual re-runs

1.2 Project Goals

Phase 1 Goal (PoC, 8 Weeks):

  • Build an AI Diagnosis Assistant to automatically analyze root causes of test failures.
  • Target: Diagnosis accuracy > 80%, Avg. diagnosis time < 30s.

Phase 2 Goal (MVP, 12 Weeks):

  • Implement Visual Location capabilities to reduce test breakage from DOM changes.
  • Target: Increase test script survival rate from 70% to 95%.

Phase 3 Goal (Production, 16 Weeks):

  • Achieve Autonomous Exploratory Testing to discover edge cases uncovered by humans.
  • Target: New bugs discovered > 10 per month.

1.3 Out of Scope

  • Load Testing / Performance Testing
  • Security Penetration Testing
  • Mobile App Testing (Web only)
  • Replacing existing Unit and Integration Tests

2. Core Concept: From Automation to Autonomy

2.1 Traditional Automation vs. AI Autonomy

1Traditional Automation (Imperative):
2Developer defines
3Click #login-btn
4Wait 2s
5Assert URL contains /dashboard
6Problems:
7Fixed paths, cannot handle unexpected situations
8Fragile element locators; breaks on DOM changes

2.2 The 3-Loop Verification Concept

1┌─────────────────────────────────────────────────────────────┐
2Exploration Loop │
3┌─────────────────────────────────────────────────────┐ │
4Diagnosis Loop │ │
5┌─────────────────────────────────────────────┐ │ │
6Stability Loop │ │ │
7Execute Single Test Action │ │ │

3. Technology Stack and Versions

3.1 Core Stack

ComponentTech ChoiceVersionRationale
LLM OrchestrationLangChain4j0.35.0Java-native, excellent Spring Boot integration, robust Tool Calling
LLM ModelGPT-4o2024-08-06Superior vision, stable reasoning, best Function Calling support
Backup ModelGPT-4o-mini2024-07-18Lower cost, used for simple judgments
Browser AutomationPlaywright1.48.0More stable than Selenium, multi-browser support, official Java API
Test ContainersTestcontainers1.20.3Database isolation, consistent environment
ObservabilityMicrometer + OTLP1.13.0Spring Boot integration, TraceId propagation support

3.2 Dependency Compatibility Matrix

Spring Boot 3.3.x
├── Java 21 (required)
├── LangChain4j 0.35.0
│   └── langchain4j-open-ai 0.35.0
│   └── langchain4j-spring-boot-starter 0.35.0
├── Playwright 1.48.0
│   └── playwright-java 1.48.0
└── Testcontainers 1.20.3
    └── postgresql 1.20.3

4. Project Setup

4.1 Maven pom.xml

<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
         http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>com.example</groupId>
    <artifactId>ai-qa-agent</artifactId>
    <version>1.0.0-SNAPSHOT</version>
    <packaging>jar</packaging>

    <parent>
        <groupId>org.springframework.boot</groupId>
        <artifactId>spring-boot-starter-parent</artifactId>
        <version>3.3.5</version>
        <relativePath/>
    </parent>

    <properties>
        <java.version>21</java.version>
        <langchain4j.version>0.35.0</langchain4j.version>
        <playwright.version>1.48.0</playwright.version>
        <testcontainers.version>1.20.3</testcontainers.version>
    </properties>

    <dependencies>
        <!-- Spring Boot -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-web</artifactId>
        </dependency>
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-actuator</artifactId>
        </dependency>

        <!-- LangChain4j -->
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-spring-boot-starter</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>
        <dependency>
            <groupId>dev.langchain4j</groupId>
            <artifactId>langchain4j-open-ai</artifactId>
            <version>${langchain4j.version}</version>
        </dependency>

        <!-- Playwright -->
        <dependency>
            <groupId>com.microsoft.playwright</groupId>
            <artifactId>playwright</artifactId>
            <version>${playwright.version}</version>
        </dependency>

        <!-- Testcontainers -->
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>testcontainers</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>
        <dependency>
            <groupId>org.testcontainers</groupId>
            <artifactId>postgresql</artifactId>
            <version>${testcontainers.version}</version>
        </dependency>

        <!-- Observability -->
        <dependency>
            <groupId>io.micrometer</groupId>
            <artifactId>micrometer-tracing-bridge-otel</artifactId>
        </dependency>
        <dependency>
            <groupId>io.opentelemetry</groupId>
            <artifactId>opentelemetry-exporter-otlp</artifactId>
        </dependency>

        <!-- Utilities -->
        <dependency>
            <groupId>org.projectlombok</groupId>
            <artifactId>lombok</artifactId>
            <optional>true</optional>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
        </dependency>

        <!-- Testing -->
        <dependency>
            <groupId>org.springframework.boot</groupId>
            <artifactId>spring-boot-starter-test</artifactId>
            <scope>test</scope>
        </dependency>
    </dependencies>

    <build>
        <plugins>
            <plugin>
                <groupId>org.springframework.boot</groupId>
                <artifactId>spring-boot-maven-plugin</artifactId>
            </plugin>
            <!-- Install Playwright Browsers -->
            <plugin>
                <groupId>org.codehaus.mojo</groupId>
                <artifactId>exec-maven-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <id>install-playwright-browsers</id>
                        <phase>generate-resources</phase>
                        <goals>
                            <goal>java</goal>
                        </goals>
                        <configuration>
                            <mainClass>com.microsoft.playwright.CLI</mainClass>
                            <arguments>
                                <argument>install</argument>
                                <argument>chromium</argument>
                            </arguments>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

4.2 application.yml

spring:
  application:
    name: ai-qa-agent

langchain4j:
  open-ai:
    chat-model:
      api-key: ${OPENAI_API_KEY}
      model-name: gpt-4o
      temperature: 0.1  # Stability is key for testing
      timeout: PT60S    # Vision analysis can be slow
      max-retries: 3
      log-requests: true
      log-responses: true

ai-qa:
  browser:
    headless: true
    viewport-width: 1280
    viewport-height: 720
    timeout-ms: 30000

  loops:
    stability:
      max-retries: 3
      retry-delay-ms: 1000
      flakiness-threshold: 0.8  # >80% success rate deemed flaky
    diagnosis:
      collect-screenshot: true
      collect-console-logs: true
      collect-network-logs: true
      max-log-lines: 500
    exploration:
      max-depth: 10
      max-actions-per-page: 20

  cost:
    budget-per-test-usd: 0.50
    budget-per-day-usd: 100.00

  reporting:
    output-dir: ./test-reports
    screenshot-format: png

# Target System Configuration
target:
  base-url: ${TARGET_BASE_URL:http://localhost:8080}
  api-base-url: ${TARGET_API_URL:http://localhost:8080/api}

# Actuator (For collecting backend logs in Diagnosis Loop)
management:
  endpoints:
    web:
      exposure:
        include: health,info,loggers,trace
  tracing:
    sampling:
      probability: 1.0

5. Core Component Implementation

5.1 OpenAI Configuration

package com.example.aiqaagent.config;

import dev.langchain4j.model.chat.ChatLanguageModel;
import dev.langchain4j.model.openai.OpenAiChatModel;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Bean;
import org.springframework.context.annotation.Configuration;

import java.time.Duration;

@Configuration
public class OpenAiConfig {

    @Value("${langchain4j.open-ai.chat-model.api-key}")
    private String apiKey;

    /**
     * Primary Model: GPT-4o, for complex reasoning and vision.
     */
    @Bean
    public ChatLanguageModel primaryChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(60))
                .maxRetries(3)
                .logRequests(true)
                .logResponses(true)
                .build();
    }

    /**
     * Lightweight Model: GPT-4o-mini, for cost-saving simple tasks.
     */
    @Bean
    public ChatLanguageModel lightweightChatModel() {
        return OpenAiChatModel.builder()
                .apiKey(apiKey)
                .modelName("gpt-4o-mini")
                .temperature(0.1)
                .timeout(Duration.ofSeconds(30))
                .maxRetries(3)
                .build();
    }
}

5.2 Playwright Config & Lifecycle

package com.example.aiqaagent.config;

import com.microsoft.playwright.*;
import jakarta.annotation.PreDestroy;
import lombok.Getter;
import lombok.extern.slf4j.Slf4j;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.context.annotation.Configuration;

@Slf4j
@Configuration
public class PlaywrightConfig {

    @Value("${ai-qa.browser.headless:true}")
    private boolean headless;

    // ... viewport properties ...

    private Playwright playwright;
    private Browser browser;

    @Getter
    private volatile BrowserContext currentContext;

    @Getter
    private volatile Page currentPage;

    public synchronized void initialize() {
        if (playwright == null) {
            log.info("Initializing Playwright...");
            playwright = Playwright.create();
            browser = playwright.chromium().launch(
                new BrowserType.LaunchOptions()
                    .setHeadless(headless)
            );
            log.info("Playwright initialized successfully");
        }
    }

    public Page createNewPage() {
        initialize();
        if (currentContext != null) currentContext.close();

        currentContext = browser.newContext(
            new Browser.NewContextOptions().setViewportSize(1280, 720)
        );

        currentPage = currentContext.newPage();
        
        // Setup listeners
        currentPage.onConsoleMessage(msg ->
            log.debug("[Browser Console] {}: {}", msg.type(), msg.text())
        );

        return currentPage;
    }

    public String captureScreenshotBase64() {
        if (currentPage == null) throw new IllegalStateException("No active page");
        byte[] screenshot = currentPage.screenshot();
        return java.util.Base64.getEncoder().encodeToString(screenshot);
    }
    
    // ... cleanup methods ...
}

5.3 Browser Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import org.springframework.stereotype.Component;
// ... imports ...

@Component
@RequiredArgsConstructor
public class BrowserTools {

    private final PlaywrightConfig playwrightConfig;

    @Tool("Open specified URL. Returns page title.")
    public String navigateTo(String url) {
        Page page = playwrightConfig.getCurrentPage();
        page.navigate(url);
        page.waitForLoadState(LoadState.NETWORKIDLE);
        return "Loaded page: " + page.title();
    }

    @Tool("Click button/link containing text. Matches exact or partial.")
    public String clickByText(String text) {
        Page page = playwrightConfig.getCurrentPage();
        try {
            Locator locator = page.getByText(text);
            locator.first().waitFor();
            locator.first().click();
            return "Clicked element containing: " + text;
        } catch (TimeoutError e) {
            return "Could not find clickable element with text: " + text;
        }
    }

    @Tool("Get interactive elements (buttons, links, inputs) on current page.")
    public String getInteractiveElements() {
        // Implementation to scan DOM and return list of interactive elements
        // Returns formatted string list
        return "..."; 
    }
    
    // ... other tools like fillInput, scroll, pressKey ...
}

5.4 Vision Tools

package com.example.aiqaagent.tools;

import dev.langchain4j.agent.tool.Tool;
import dev.langchain4j.data.message.*;
// ... imports ...

@Component
@RequiredArgsConstructor
public class VisionTools {

    private final PlaywrightConfig playwrightConfig;
    
    @Qualifier("primaryChatModel")
    private final ChatLanguageModel visionModel;

    private static final String VISION_LOCATE_PROMPT = """
        You are a Vision Analysis Assistant.
        Task: Locate the element matching this description in the screenshot: %s
        
        Return CENTER coordinates:
        COORDINATES: x=123, y=456
        
        If not found:
        NOT_FOUND: Reason
        """;

    @Tool("Locate element using Vision AI based on visual description.")
    public String clickByVision(String visualDescription) {
        String screenshotBase64 = playwrightConfig.captureScreenshotBase64();
        
        UserMessage msg = UserMessage.from(
            ImageContent.from(screenshotBase64, "image/png"),
            TextContent.from(String.format(VISION_LOCATE_PROMPT, visualDescription))
        );

        String response = visionModel.generate(msg).content().text();
        
        // Parse coordinates and click using Playwright
        // ... implementation ...
        
        return "Clicked via vision at " + response;
    }
}

5.5 Diagnostic & Data Tools

(Conceptual implementation similar to Chinese version: DiagnosticTools collects logs/screenshots, DataTools manages Testcontainers PostgreSQL instance.)


6. The 3-Loop Verification Strategy: State Machine Design

6.1 Loop Logic

  1. Stability Loop: Handles flaky tests.

    • If an action fails with transient errors (Timeout, 503), retry N times.
    • If success rate < 100% but > 0%, mark as “Flaky” but Passed.
  2. Diagnosis Loop: Handles hard failures.

    • Collects Evidence (Screenshot + Console + Backend Logs via TraceId).
    • Asks AI to analyze Root Cause (Frontend vs. Backend vs. Data).
  3. Exploration Loop: Handles path planning.

    • Determines next action based on Goal and Page State.
    • Uses RL-like approach to maximize coverage of unknown paths.

7. Prompt Engineering

7.1 System Prompt

You are a Senior QA Automation Engineer.

Capabilities:
1. Test Planning: Plan paths based on business goals.
2. Execution: Operate browser via tools.
3. Diagnosis: Analyze root causes upon failure.
4. Exploration: Proactively find edge cases.

Guidelines:
- Verify result after every step.
- If action fails, try alternatives before reporting failure.
- Collect evidence (screenshots) regularly.
- When diagnosing, distinguish between Frontend, Backend, and Environment issues.

7.2 Diagnosis Prompt

Analyze this test failure.

Action: {action_description}
Error: {error_message}
Evidence: {evidence}

Provide:
1. Root Cause Category (FRONTEND_BUG, BACKEND_BUG, ENVIRONMENT, TEST_SCRIPT, DATA_ISSUE)
2. Description
3. Technical Details
4. Suggested Fix
5. Confidence Level

8. Integration Layer Implementation

8.1 AutonomousTester Interface

public interface AutonomousTester {

    @SystemMessage("...")
    void initialize();

    @UserMessage("Test Goal: {{goal.description}}")
    TestReport performTest(@MemoryId String testId, TestGoal goal);

    @Tool("Execute atomic browser action")
    ActionResult executeAction(TestAction action);
}

8.2 TestOrchestratorService

Orchestrates the LoopStateMachine and calls AutonomousTester. Manages the lifecycle of Playwright pages and Testcontainers.


9. Evaluation Framework and Metrics

MetricPhase 1 (PoC)Phase 2 (MVP)Phase 3 (Prod)
Diagnosis Accuracy> 80%> 90%> 95%
Script Survival RateN/A> 95%> 99%
New Bugs FoundN/AN/A> 10/month

10. Adoption Plan and Milestones

Phase 1: Diagnosis Assistant (8 Weeks)

  • Goal: AI analyzes failure logs from existing CI pipelines.
  • Deliverable: Automated Root Cause Analysis Report attached to Jenkins builds.

Phase 2: Visual & Self-Healing (12 Weeks)

  • Goal: Implement VisionTools and Self-Healing locators.
  • Deliverable: A test suite that survives major UI refactoring without manual fixes.

Phase 3: Autonomous Exploration (16 Weeks)

  • Goal: Full “Nightly Build” exploration.
  • Deliverable: Autonomous testing of core business flows with minimal human input.

11. Risk Assessment and Mitigation

RiskAssessmentMitigation Strategy
High Token CostHigh1. Prioritize GPT-4o-mini or local LLMs for simple tasks.
2. Optimize Prompts to reduce token usage.
3. Implement strict cost monitoring and budget caps.
4. Use GPT-4o only for core validation and complex diagnosis.
AI Hallucination / MisjudgmentMedium1. Increase clarity and professionalism of System Prompts.
2. Introduce human-in-the-loop review for initial training (RLHF).
3. Set confidence thresholds; low-confidence judgments require manual intervention.
Flaky Test ResultsMedium1. The Stability Loop itself is a mitigation measure.
2. Optimize synchronization between Playwright screenshots and Vision analysis.
3. Provide full evidence (video, screenshots, logs) for human verification.
Data PrivacyMedium1. Strictly prohibit sending PII/sensitive production data to LLM APIs.
2. Anonymize test data.
3. Consider enterprise-grade solutions (e.g., Azure OpenAI Service) or local LLMs.
Long Implementation TimeMedium1. Adopt a phased approach (PoC, MVP, Production).
2. Set clear acceptance criteria for each milestone.
3. Ensure early involvement and feedback from developers and QA.

12. Cost Estimation

12.1 OpenAI API Costs

Assumptions:

  • GPT-4o Vision analysis per screenshot: ~$0.05
  • GPT-4o complex reasoning (diagnosis, planning): ~$0.03
  • GPT-4o-mini simple judgment: ~$0.001
ScenarioEst. Runs / DayUnit CostDaily Cost
Diagnosis Loop50$0.03$1.50
Exploration Loop200$0.03$6.00
Visual Location1000$0.05$50.00
GPT-4o-mini5000$0.001$5.00
Total$62.50

Estimated Monthly Cost (Production): $62.50 * 22 working days = $1,375 USD

12.2 Human Resource Costs (PoC Phase – 8 Weeks)

RoleMan-MonthsMonthly SalaryTotal
Senior Java Dev (AI Specialty)2$5,000$10,000
Senior QA (Req. Definition)0.5$3,500$1,750
Total$11,750 USD

12.3 Infrastructure Costs

Existing dev machines suffice for PoC. Production will require additional Docker Host or K8s node resources.


13. Decision Checkpoints

  • After 8 Weeks (End of PoC): Is Diagnosis Accuracy > 80%? Is Diagnosis Time < 30s? If not, pivot or stop.
  • After 20 Weeks (End of MVP): Is Script Survival Rate > 95%? Does the AI Co-pilot significantly boost QA efficiency? Is token cost within budget?
  • After 36 Weeks (Production): Has the number of new bugs found significantly increased? Is coverage reaching 80%? Has QA focus successfully shifted from execution to strategy?

Resources should only be committed further if milestones are met at these checkpoints, gradually scaling the AI Autonomous Testing to more business modules.