🌏 閱讀中文版本
The AI Missed the Key Finding in a 200-Page Report
You ask ChatGPT to analyze a 200-page report.
It quickly gives you a summary: “The key points are A, B, C…”
You breathe a sigh of relief, ready to paste the conclusion into your slides.
But then!
A colleague asks: “What about the risk assessment on page 150?”
You check the report. That section was completely ignored.
⸻
Fitting into the context window doesn’t mean the LLM actually “read” everything.
So you think: maybe I should use an Agent? Let it process in chunks?
But the Agent runs for 20 minutes and keeps going in circles…
⸻
⚡ TL;DR
- Core problem: Single calls aren’t enough, but Agents are overkill—what’s in between?
- Key insight: It’s not binary. It’s a spectrum: Single Call → CoT → RAG → Manual Orchestration → Agent
- Decision factors: Data size, validation needs, control level, cost budget
- Common mistake: Jumping straight from single calls to Agents, skipping the middle options
It’s Not Binary—It’s a Spectrum
Many people think of LLM usage as “single call vs. Agent.”
That’s wrong.
It’s actually a spectrum with many options in between:
flowchart LR
A["<b>Single Call</b><br/>The basics"] --> B["<b>CoT</b><br/>Step-by-step thinking"]
B --> C["<b>RAG</b><br/>Retrieve then generate"]
C --> D["<b>Manual Orchestration</b><br/>You control the flow"]
D --> E["<b>Agent</b><br/>AI decides next step"]
E --> F["<b>Deep Decomposition</b><br/>Extreme cases"]
style A fill:#e8f5e9
style B fill:#e3f2fd
style C fill:#fff3e0
style D fill:#fce4ec
style E fill:#f3e5f5
style F fill:#ffebeeSimple → Complex | Cheap → Expensive | Fast → Slow
Each method solves different problems:
| Method | Solves What | Typical Use Case |
|---|---|---|
| Single Call | Simple tasks | Translation, summarization, Q&A |
| CoT Prompting | Poor reasoning quality | Math, logic, complex analysis |
| RAG | Data too long or needs external knowledge | Long documents, knowledge base queries |
| Manual Orchestration | Need intermediate validation | Fixed workflows, staged processing |
| Agent | Unknown number of steps, dynamic decisions | Exploratory tasks, tool chaining |
| Deep Decomposition (RLM, etc.) | Extreme length, deep recursion | Million-line logs, huge codebases |
Key Insight: 80% of tasks can be solved with the first three methods. Jumping straight to Agents is usually over-engineering.
Method 1: Single Call (Baseline)
The simplest approach: one prompt, one response.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Translate this to English: ..."}]
)When it works:
- Problem is clear, no exploration needed
- Data fits in context window (and isn’t too long)
- Errors are acceptable or easy to catch manually
- Need fast response
Examples:
- ✅ Translating an email
- ✅ Summarizing a 3000-word article
- ✅ Answering specific technical questions (“How to read JSON in Python?”)
- ✅ Generating short code snippets (under 50 lines)
Cost estimate: 1 API call, ~1K-10K tokens
Method 2: Chain-of-Thought (Step-by-Step Thinking)
When single calls don’t reason well enough—the lightest upgrade.
Core concept: Not multiple API calls, but asking the LLM to think step by step within one response.
Before (Single Call)
Prompt: "What's wrong with this function?"
Response: "Looks fine to me." ← Might miss detailsAfter (CoT Prompting)
Prompt: "What's wrong with this function? Please analyze step by step:
1. First check input validation
2. Then check edge cases
3. Finally check error handling"
Response:
"1. Input validation: No null check...
2. Edge cases: Crashes when array is empty...
3. Error handling: Missing try-catch..." ← More thoroughWhen to use:
- Complex reasoning (math, logic)
- Need more careful analysis
- Single calls often miss details
Cost estimate: Still 1 API call, but more output tokens (~1.5-2x)
Key Insight: CoT is a “free upgrade”—no architecture changes, just change your prompt. If single calls aren’t working, try CoT before anything else.
Method 3: RAG (Retrieve Then Generate)
When data is too long, or you need external knowledge.
Core concept: Don’t stuff everything into context. Find relevant chunks first, then let the LLM process them.
flowchart TD
Q["❓ Your question<br/>What's the risk assessment conclusion?"]
Q --> S["🔍 Step 1: Vector search<br/>Find relevant sections from 200 pages"]
S --> R["📄 Step 2: Filter results<br/>Keep only the relevant 5 pages"]
R --> L["🤖 Step 3: LLM generates<br/>Answer based on those 5 pages"]
style Q fill:#e3f2fd
style S fill:#fff3e0
style R fill:#e8f5e9
style L fill:#f3e5f5Why it works:
- Avoids “Lost in the Middle” problem (LLMs tend to ignore middle sections of long contexts)
- Reduces cost (only process relevant chunks)
- Can handle data exceeding context window limits
When to use:
- Long document analysis (100+ pages)
- Knowledge base Q&A
- Scenarios requiring source citations
Cost estimate: Vector search + 1 API call, usually cheaper than stuffing full context
Common tools: LangChain, LlamaIndex, Pinecone, Chroma
Method 4: Manual Orchestration
When you need intermediate validation, or the workflow is fixed.
Core concept: You decide the flow, calling LLM at each step.
Example: Analyzing 10 files for code architecture
summaries = []
for file in files:
summary = llm.analyze(f"Analyze main functions in: {file}")
summaries.append(summary)
# Step 2: Synthesize all summaries
architecture = llm.synthesize(f"Based on these summaries, describe the architecture: {summaries}")
# Step 3: Generate final report
report = llm.generate(f"Based on the architecture analysis, create a technical report: {architecture}")Manual Orchestration vs. Agent:
| Manual Orchestration | Agent | |
|---|---|---|
| Who decides next step? | You (the engineer) | AI |
| Is the flow fixed? | Yes | Dynamic |
| Predictability | High | Low |
| Best for | Known workflows | Unknown workflows |
When to use:
- Multi-step process, but each step is predetermined
- Need to add validation or human review between steps
- Want maximum control
Cost estimate: N API calls (N = number of steps)
Key Insight: Many people jump straight to Agents, but Manual Orchestration is more controllable, cheaper, and easier to debug. Ask yourself: can I hardcode this workflow?
Method 5: Agent (AI Decides Next Step)
When you don’t know how many steps are needed, or need dynamic decisions.
Core concept: Let the AI decide what to do next, until the task is complete.
flowchart TD
T["🎯 Task<br/>Find the root cause of this bug"]
T --> A1["🤔 Agent thinks<br/>I need to check the error log first"]
A1 -->|"calls read_file"| A2["🤔 Agent thinks<br/>Error points to db.py, let me check"]
A2 -->|"calls read_file"| A3["🤔 Agent thinks<br/>Found unclosed connection, let me verify..."]
A3 -->|"calls search"| R["✅ Conclusion<br/>Root cause: connection pool has no timeout"]
style T fill:#e3f2fd
style A1 fill:#fff3e0
style A2 fill:#fff3e0
style A3 fill:#fff3e0
style R fill:#e8f5e9When to use:
- Exploratory tasks (don’t know where the answer is)
- Need to chain multiple tools
- Variable number of steps
Risks:
- May loop endlessly (retrying the same error)
- May over-decompose (turning simple problems into 10 steps)
- Unpredictable costs
Common frameworks: LangChain Agent, CrewAI, AutoGen, Claude Code
Cost estimate: Unpredictable—could be 5 calls, could be 50
Method 6: Deep Decomposition (RLM, etc.)
When handling extremely long content or needing true recursion.
This is the newest, heaviest approach—for extreme cases.
RLM (Recursive Language Models) is a framework from MIT that enables LLMs to:
- Store long text as variables instead of stuffing into prompts
- Dynamically partition and recursively search
- Maintain performance at million-token scale
When to use:
- Finding specific errors in 1 million lines of logs
- Analyzing massive codebases
- Extreme long-form content where other methods fail
Current recommendation: Unless your task is truly extreme, use the first 5 methods first.
Decision Framework: 5 Questions to Choose
Not sure which method to use? Ask yourself these questions in order:
flowchart TD
Start["🤔 My task"] --> Q1{"1️⃣ Is single call<br/>good enough?"}
Q1 -->|"Yes"| A1["✅ Single Call"]
Q1 -->|"No, reasoning too shallow"| A2["✅ Try CoT"]
Q1 -->|"No, data too long"| Q2{"2️⃣ Need retrieval<br/>or full processing?"}
Q2 -->|"Need retrieval"| A3["✅ Use RAG"]
Q2 -->|"Need to see everything"| Q3{"3️⃣ Is the workflow<br/>fixed?"}
Q3 -->|"Yes, I know each step"| A4["✅ Manual Orchestration"]
Q3 -->|"No, need dynamic decisions"| Q4{"4️⃣ How much<br/>data?"}
Q4 -->|"< 100K tokens"| A5["✅ Agent"]
Q4 -->|"> 100K tokens"| A6["✅ Deep Decomposition (RLM)"]
style A1 fill:#e8f5e9
style A2 fill:#e8f5e9
style A3 fill:#e8f5e9
style A4 fill:#e8f5e9
style A5 fill:#e8f5e9
style A6 fill:#e8f5e9Budget note: If time or budget is limited, start with simpler methods and upgrade gradually.
Quick Reference Table
| Your Situation | Recommended Method |
|---|---|
| Simple task, need fast response | Single Call |
| Reasoning quality not enough | CoT Prompting |
| Document too long | RAG |
| Need staged validation | Manual Orchestration |
| Don’t know how many steps | Agent |
| Extreme length (million+ tokens) | RLM |
Cost Comparison
For the same task—”analyze 10 files”:
| Method | API Calls | Token Usage | Approx. Cost (GPT-4o) |
|---|---|---|---|
| Single call (stuff all) | 1 | 50K input + 2K output | ~$0.30 |
| CoT | 1 | 50K input + 5K output | ~$0.32 |
| RAG | 1 | 10K input + 2K output | ~$0.07 |
| Manual (10 steps) | 10 | 5K × 10 = 50K total | ~$0.30 |
| Agent (assume 15 steps) | 15 | Variable, ~80K total | ~$0.50+ |
Takeaway: RAG is usually cheapest because it only processes relevant chunks. Agents are most expensive and unpredictable.
Common Misconceptions
Misconception 1: “Just use an Agent—it’s easier”
Wrong. Agent problems:
- Unpredictable costs
- Tends to loop
- Hard to debug
Recommendation: First ask “Can I hardcode this workflow?” If yes, Manual Orchestration is better.
Misconception 2: “Bigger context window is always better”
Not quite. Research shows long contexts have issues:
- Middle content tends to be ignored
- Response quality may decrease
- Cost increases linearly
128K context ≠ 128K effectively processed
Misconception 3: “RAG is only for knowledge bases”
Wrong. RAG’s core is “retrieve then generate”—useful for:
- Long document analysis
- Historical conversation reference
- Any scenario where you don’t need everything, just relevant parts
Misconception 4: “CoT requires special models”
Wrong. Any LLM supports CoT—just add “Please analyze step by step” to your prompt.
Next Steps
If you’re currently only using single calls
- Try CoT first: Add “Please analyze step by step” to your prompt
- Observe results: Did reasoning quality improve?
- Then consider RAG: If data is too long
If you’re considering Agents
- First ask: Can I really not hardcode this workflow?
- Try Manual Orchestration first: Control each step yourself
- Set limits: If using Agents, set max steps and timeouts
If you’re handling very long documents
- Try RAG first: Works for most cases
- Evaluate needs: Do you really need to “see everything”?
- Then consider RLM: Only for extreme scenarios
Related Reading
Want to dive deeper into AI engineering practices? These articles might help:
Not Every Company Should Build RAG—Here’s How to Decide
A framework for evaluating RAG readiness and applicability
Using AI to Tackle Legacy Code—What About That 4000-Line Function?
Agent in practice: incremental refactoring of large codebases with AI
The 10-Minute Value: What Does “Senior” Mean in the AI Era?
AI is redefining seniority—judgment matters more than output
Your Team Started Using AI to Code. Why Isn’t Productivity Improving?
AI engineering maturity model: from individual efficiency to team collaboration
A realistic look at AI-assisted coding: effects and limitations
Sources
Chain-of-Thought Prompting
- Chain-of-Thought Prompting Elicits Reasoning (2022) | Archive
Google research: step-by-step prompting improves reasoning, especially for math and logic.
Lost in the Middle Phenomenon
- Lost in the Middle (2023) | Archive
LLMs have lower recall for middle sections of long texts. Later models improved but issue persists.
Recursive Language Models
- RLM: Recursive Language Models (2025) | Archive
MIT framework (Zhang, Kraska, Khattab) for processing million-token contexts recursively.
Agent Design Patterns
- Building Effective Agents – Anthropic (2024) | Archive
Anthropic guide: “workflows” (fixed) are usually more reliable than “agents” (dynamic).