🌏 閱讀中文版本
Last month, something happened on the team.
A junior engineer built a new feature using Claude Code. It was efficient, and he was happy with it.
He opened a PR.
We ran an AI code review in CI — single model, one pass across all files.
The system returned a 40-page report.
120 flags.
A senior engineer spent 20 minutes on it.
80% were style.
15% were false positives.
Maybe 5 were real.
Then?
The other 115 were ignored.
That’s the cost.
When signal density drops too low, attention calibration breaks.
PR review is a high-frequency, high-noise decision point. It happens every day, and it’s never the same twice. If 95% of a report is noise, the next time someone opens it, that 5% gets scrolled past too.
The problem isn’t finding a better reviewer.
The structure of review needs to change.
When Tools Become Noise
To be clear: static analysis tools aren’t the problem.
For the past decade, SonarQube and ESLint have been a defensive moat for engineers. They keep baseline code quality in check.
But the context has shifted.
When AI starts generating code, the content of a PR changes. A PR is no longer “200 lines someone thought through over two days.” It’s “500 lines an AI wrote in two hours.”
The thing being reviewed has changed. The tools need to catch up.
AI-generated code tends to be syntactically correct, even pattern-compliant — but it can violate the implicit assumptions behind your business logic.
Worse: AI-generated code is often too clean.
So clean it loses the contextual cues humans rely on when reading.
At that point, whether you run a traditional linter or drop in an AI reviewer, you get the same result:
Notification fatigue.
Developers learn to ignore PR warnings.
Not because they’re not professional.
Because warning signal density is too low.
When a tool cries wolf 90% of the time, you stop believing it.
Until the wolf actually shows up.
Why PR Review, and Not Something Else?
You might be wondering:
“Isn’t multi-agent just running the same thing multiple times? What does that have to do with PR review specifically?”
Fair question. First, an honest acknowledgment: multi-agent isn’t a silver bullet. Architecture reviews, security audits, performance tuning — a single strong model handles those well. Running it multiple times is wasteful instead.
But PR review has two structural properties that make it uniquely suited for multi-agent returns.
First: PR Review Is “One Diff, Multiple Simultaneous Valid Angles”
A 30-line change can be, at the same time:
- A compliance issue (violates the naming conventions in CLAUDE.md)
- A semantic issue (that null check is too optimistic)
- A historical issue (this field was slated for deprecation three years ago)
- A business logic issue (this discount logic doesn’t account for VIP customers)
Four angles. None of them outranks the others.
But one reviewer — human or AI — can only read through one lens at a time.
Ask it to “cover everything,” and it covers nothing well.
This isn’t a model capability problem.
Looking at one photo through four lenses simultaneously isn’t physically possible.
Second: PR Review Is “High Frequency × Low Noise Tolerance”
It happens every day.
A few hundred lines each time.
But the moment 95% is noise, nobody reads it next time.
Compare: an architecture review happens twice a year, takes two days, and 95% noise is tolerable. A security audit is quarterly — find one real risk and it pays for itself.
PR review doesn’t have that luxury.
Put those two properties together, and the multi-agent entry point becomes clear:
Have agents with different perspectives each read the same diff independently, then surface only what they agree on.
Two independent agents flagging the same compliance risk.
That’s what makes confidence credible.
When they disagree, the issue becomes a matter of taste or context — exactly the kind of thing humans should arbitrate.
The comments that end up on the PR have been cleared by at least two independent perspectives. The rest don’t make it.
The core value of multi-agent in PR review isn’t “smarter” — it’s “cross-verifiable.”
Cost vs. Benefit
We ran internal tests on 30 PRs, comparing a single powerful model (Claude 3.5 Sonnet, historical baseline) against 4 lightweight agents running in parallel:
| Approach | Token Cost (est.) | True Risk Detection | False Positive Rate (Noise) |
|---|---|---|---|
| Single Agent | 1.0x | 65% | 45% |
| Multi-Agent (4x) | 1.8x | 88% | 12% |
Token cost up ~80%. PR noise down 73%.
Does that math work?
In high-PR-volume teams where senior time is expensive, this is a direction worth considering.
And as developer context-switching costs between PRs compound, the marginal returns keep growing.
Confidence Scoring: Quantifying Uncertainty
This is the most counterintuitive part of the architecture.
Traditional PR tools tell you: “There’s a risk here.”
This architecture tells you: “There’s a risk here, and I’m 85% confident it’s real.”
Scoring Flow
Each agent reads the PR independently and produces findings. Then comes the scoring phase.
The logic is simple: each agent scores independently.
Both flag a risk? Confidence +20%.
Sounds clean.
But here’s the thing: what if both agents are wrong?
The odds of both being wrong are much lower than one being wrong — but not zero. So the threshold is set high.
Only one flags it? Discarded. Doesn’t make the PR.
The threshold is 80.
Below 80?
No PR comment is written.
Think about it:
If a PR review report surfaces 3 risks, and each one carries ≥ 90% confidence.
What do you do?
You check them first and decide whether to fix them.
Because you know those 3 items passed a strict filter. They’re real.
Security Override
There’s one exception.
If the finding involves security, data corruption, or compliance, it goes to the PR even if only one agent flagged it.
Severity outranks confidence.
Taste vs. Correctness: Who Makes the Final Call?
There’s an important distinction here.
Correctness is objective.
Taste is subjective.
Verifiable conditions, agents can sweep for you. But you still run tests. You still have a human review the PR.
But taste?
That’s the team’s collective memory. The last domain that belongs to humans.
Good style for a startup isn’t good style for a financial system.
A financial system may need more redundancy and comments — the compliance exposure is higher.
A startup may need leaner code — the priority is fast iteration.
CLAUDE.md captures the team’s taste.
But the final PR approve button is still pressed by a human.
High-confidence items tend to involve correctness.
Low-confidence items tend to be just taste.
But that’s not a hard rule.
A convention violation can score high confidence but still be a taste call.
That’s what lets developers focus on decisions that actually matter.
When Not to Use Multi-Agent?
If your team is small, PR volume is low, and the codebase is stable.
A single agent with a well-crafted prompt is probably enough.
If your use case is “architecture review twice a year,” don’t bother with multi-agent. One strong model running for two days is more cost-effective than four agents in parallel.
But if you’re facing:
- High-velocity codebases with frequent PRs
- Multi-developer collaboration
- Strict compliance requirements
Then the marginal returns of a multi-agent architecture on PR review are meaningful.
Multi-agent fits reviews that are high-frequency × multi-angle × low noise tolerance — and PR review is exactly that.
In other scenarios, the calculus may not hold.
Implementation: How to Plug Into Your PR Workflow?
If this architecture looks interesting, here’s where to start.
1. Define Your Conventions
First, you need a clear conventions document — something that tells the agents what good looks like for this repo.
That can be CLAUDE.md, or any other YAML file you prefer.
The document should cover:
- Code style
- Naming conventions
- Error handling strategy
- Testing requirements
2. Configure Agents and an Aggregator
Use Claude Code or any other tool that supports multi-agent.
Configure each agent’s area of focus (compliance, semantics, historical context, business logic).
Make sure they run independently with no shared state.
Then you need a simple aggregator.
It receives the output of all agents on the same PR and runs the scoring logic.
3. Set the Confidence Threshold
Start with an initial threshold — say, 80.
Watch a few weeks of PR review results.
If you’re missing real risks, lower it.
If the PR still has too many comments, raise it.
Use precision/recall or missed critical findings as your calibration signal.
4. Integrate into Your CI/CD Pipeline
Recommended automation flow:
- Developer Push: Developer opens a PR on GitHub.
- Multi-Agent Trigger: GitHub Actions spins up multiple agent instances, each reading the same diff.
- Parallel Analysis: Each agent independently evaluates against
CLAUDE.mdand syntax rules. - Scoring Engine: Runs the scoring logic, calculating confidence scores and evidence strength.
- PR Commenting: Auto-comments only on risks with confidence ≥ 80%.
- Human Review: Developer focuses on high-confidence risks and completes the final review.
The confidence threshold isn’t static.
It should adjust dynamically based on team maturity and codebase stability.
Start higher early on. As the team builds confidence in the system, gradually bring it down.
Closing: Judgment First
Tools amplify whatever judgment process you already have.
In a multi-agent architecture, your value shows up in:
Defining the conventions.
Setting the threshold.
Making the final call on the PR.
You know what good code looks like.
You know which risks matter.
You know when to trust the AI and when to question it.
As AI gets stronger, our role on PRs increasingly looks like “reviewer.”
Not because we understand the technology better — because we understand what deserves to be ignored.
Your job shifts from “write it right” to “decide what on this PR is worth fixing.”
That’s what seniority actually looks like.
Not writing fast.
Judging accurately.
Next time you open a PR and see 120 flags — will you read all of them?