🌏 閱讀中文版本
Your Team Started Using AI to Write Code
Last week, a PR was submitted in just two hours.
500 lines. Logic looked solid. No obvious issues.
During review, we found inconsistent code style, missing type definitions, and several functions that duplicated existing modules but with different implementations.
After fixing all that, the total time didn’t match expectations.
Another project tried using AI to refactor legacy code.
The AI cleaned it up. The logic was indeed clearer.
After deployment, three seemingly unrelated features broke.
This isn’t about AI being bad.
It’s that your codebase isn’t ready to let AI help.
⚡ 3-Second Summary
- Core issue: AI output quality ceiling = codebase maturity
- Framework: 5-level engineering foundation model (L1-L5)
- For: Teams adopting AI coding tools, managers evaluating AI ROI
- Not for: Those just learning AI tool operations (this is about “environment readiness”)
Part B: Why AI Isn’t Magic
The Ceiling of AI Tools Isn’t AI Itself
In 2025, 85% of developers are using AI tools to write code.
But according to Gartner, 43% of enterprises abandon AI projects due to “lack of technical maturity.”
S&P Global’s data is more direct: in 2025, 42% of companies abandoned most of their AI initiatives, more than doubling from 17% in 2024.
The key isn’t the tool itself.
The key is: how much automation can your codebase handle?
Think of it this way:
Even the best chef can’t make good food with bad ingredients.
Even the strongest AI can’t write good code with an inconsistent codebase.
The Real Problems Behind Those Two Cases
Back to the two scenarios from the opening:
| Symptom | Surface Problem | Real Problem |
|---|---|---|
| More PR rejections | AI writes bad code? | Codebase lacks unified style and type definitions |
| Refactor breaks three things | AI doesn’t understand business logic? | Too much tech debt, no test coverage |
AI just writes based on what it “sees.”
If what it sees is chaos, what it produces is chaos.
The Engineering Maturity Pyramid
This isn’t a new concept. Software engineering has long used “maturity models” to describe codebase governance levels.
Applied to AI adoption, it breaks down into 5 levels:
L5: Architecture Drift Correction
↑ AI can do system-level refactoring
L4: Tech Debt Cleanup
↑ AI changes won't cascade failures
L3: Dependency Updates & Security Patches
↑ AI output can safely go to production
L2: Types & Documentation
↑ AI can correctly infer intent
L1: Formatting & Import Organization
↑ AI output has consistent style
Each level is a prerequisite for the next.
Key Insight: The ceiling of AI tools isn’t the AI’s capability—it’s how much automation your codebase can handle.
Part A: 5-Level Adoption Roadmap
L1: Formatting & Import Organization
Goal: Make the codebase look like “one person wrote it”
How:
- Adopt Prettier / ESLint / gofmt / black
- CI enforces linting—PRs can’t merge without passing
- Remove unused imports
Completion Criteria:
lint error = 0- New PRs don’t get rejected for formatting issues
What AI Can Do:
- Produce style-consistent code
- No more “tabs here, spaces there” review comments
L2: Types & Documentation
Goal: Let AI “understand” your code
How:
- TypeScript / Python typing
- Add docstrings to key functions
- Complete API documentation
Completion Criteria:
- Type coverage ≥ 80%
- Core modules have documentation
What AI Can Do:
- Correctly infer function inputs and outputs
- No more guessing types and producing runtime errors
This level is particularly important.
AI relies heavily on types and documentation for automated modifications. Without them, AI inference becomes very unstable.
L3: Dependency Updates & Security Patches
Goal: Reduce known vulnerability and version risks
How:
- Adopt Dependabot / Renovate
- CVE scanning, SBOM management
- Regular dependency updates
Completion Criteria:
- CVE high/critical = 0
- Dependencies within 2 major versions
What AI Can Do:
- Generated code won’t introduce known vulnerabilities
- Auto-upgrade PRs can safely merge
L4: Tech Debt Cleanup
Goal: Prevent AI changes from “pulling one thread and unraveling three”
How:
- Build a refactoring roadmap
- Add tests (at least for core paths)
- Modular decomposition
Completion Criteria:
- Test coverage ≥ 60% (core paths ≥ 80%)
- Clear inter-module dependencies
What AI Can Do:
- Local refactoring won’t cause cascade failures
- Change scope is predictable and verifiable
This level is the threshold for AI to “truly boost productivity.”
The more tech debt, the more likely AI modifications will fail, with unpredictable impact scope.
L5: Architecture Drift Correction
Goal: Bring system architecture back to a maintainable state
How:
- Realign with architectural principles
- Re-draw domain boundaries
- System-level refactoring
Completion Criteria:
- Clear module boundaries
- Architecture docs match implementation
What AI Can Do:
- Help analyze dependency graphs
- Suggest module boundaries
- Generate system-level refactoring PRs (but needs strong guardrails)
What Happens When You Skip Levels
| Skipped | Common Outcome |
|---|---|
| L1 | AI output has mixed styles, review time increases |
| L2 | AI guesses types wrong, runtime errors increase |
| L3 | AI introduces vulnerable dependencies, security incidents |
| L4 | AI changes one thing, breaks three others |
| L5 | AI makes things messier, system eventually becomes uncontrollable |
Key Insight: Each level is a prerequisite for the next. Skipping levels causes problems.
Part C: How Teams Should Divide the Work
Who Should Own Which Level?
| Role | Levels | Specific Tasks |
|---|---|---|
| Junior | L1-L2 | Lint setup, type additions, basic docs |
| Mid-level | L2-L3 | Complex types, dependency updates, security scans |
| Senior | L3-L4 | Security architecture, tech debt prioritization, test strategy |
| Architect | L4-L5 | Architecture governance, module boundaries, system refactoring |
This division has two benefits:
- Juniors have a clear growth path—moving from L1 to L2 builds foundational skills
- Seniors don’t waste time on formatting issues—CI should catch those at L1
What Metrics Should Managers Watch?
No need to understand technical details. Just watch these numbers:
L1: Lint Error Count
| Item | Description |
|---|---|
| Meaning | Number of code style inconsistencies |
| Healthy | = 0 |
| Warning | > 50 (team isn’t managing formatting) |
| Typical | Legacy projects often show 200-500+ when first adding linting |
| How to check | CI reports, or run npm run lint |
L2: Type Coverage
| Item | Description |
|---|---|
| Meaning | How much code has explicit type definitions (enables AI inference) |
| Healthy | ≥ 80% |
| Warning | < 50% (AI will guess types wrong) |
| Typical | JavaScript-to-TypeScript migrations start at 30-50% |
| How to check | npx type-coverage, or IDE built-in tools |
L3: CVE High/Critical Count
| Item | Description |
|---|---|
| Meaning | Number of known high-risk security vulnerabilities in dependencies |
| Healthy | = 0 |
| Warning | > 0 (known vulnerabilities unpatched) |
| Typical | Projects not updated for 6 months usually have 5-20 |
| How to check | npm audit, snyk test, GitHub Dependabot |
L4: Test Coverage
| Item | Description |
|---|---|
| Meaning | How much code is protected by automated tests |
| Healthy | ≥ 60% (core paths ≥ 80%) |
| Warning | < 30% (changing code is like defusing a bomb) |
| Typical | Projects without deliberate maintenance are around 10-30% |
| How to check | jest --coverage, SonarQube |
L5: Module Coupling
| Item | Description |
|---|---|
| Meaning | How complex the dependencies between modules are |
| Healthy | Project-specific (lower is better) |
| Warning | Single module depended on by > 10 other modules |
| Typical | Legacy projects often have “God modules” with 30+ dependents |
| How to check | madge --circular, SonarQube, dependency graph tools |
If your team says “AI tools aren’t working,” check these numbers first.
If the numbers aren’t there, the problem isn’t AI.
Gradual Adoption Recommendations
Don’t try to do L1-L5 all at once. Instead:
- Start with L1—simplest, quickest wins
- Stabilize L1, then do L2—type additions take time
- Make the codebase a little better with each PR—Boy Scout Rule
Time estimates (mid-sized project):
| Level | Estimated Timeline |
|---|---|
| L1 | 1-2 weeks |
| L2 | 1-3 months |
| L3 | Ongoing |
| L4 | 3-6 months |
| L5 | Depends on architecture complexity |
Key Insight: The higher the maturity, the more AI can evolve from “assistant” to “automation.”
Next Steps
5-Question Self-Assessment: What Level Is Your Team At?
- L1: Do PRs get rejected for “formatting issues”?
- L2: Are AI-generated types correct? Or do you constantly fix them manually?
- L3: When did you last update dependencies? Any known vulnerabilities?
- L4: Would you let AI do refactoring? Or are you afraid of cascade failures?
- L5: Does system architecture match documentation? Or have they diverged?
Checklist: Completion Criteria for Each Level
□ L1: Lint error = 0, CI enforcement enabled
□ L2: Type coverage ≥ 80%, core functions documented
□ L3: CVE high/critical = 0, dependencies regularly updated
□ L4: Test coverage ≥ 60%, core paths ≥ 80%
□ L5: Architecture docs match implementation
Sources
AI Project Failure Rates
- Gartner: Lack of AI-Ready Data Puts AI Projects at Risk (2025) | Archive
43% of enterprises abandon AI projects due to lack of technical maturity; Gartner predicts 60% of AI projects will be abandoned by 2026 due to lack of AI-ready data.
Root Causes of AI Project Failure
- RAND: The Root Causes of Failure for Artificial Intelligence Projects (2024) | Archive
80% of AI projects fail—twice the failure rate of non-AI projects. Main causes: data quality, lack of technical maturity, skill shortages.
AI Tool Adoption Statistics
- Jellyfish: 2025 AI Metrics in Review | Archive
90% of teams use AI tools (up from 61% in 2024); Cursor’s market share grew from 20% to 40%, catching up to Copilot.