🌏 閱讀中文版本
The AI Missed the Key Finding in a 200-Page Report
You ask ChatGPT to analyze a 200-page report.
It quickly gives you a summary: “The key points are A, B, C…”
You breathe a sigh of relief, ready to paste the conclusion into your slides.
But then!
A colleague asks: “What about the risk assessment on page 150?”
You check the report. That section was completely ignored.
⸻
Fitting into the context window doesn’t mean the LLM actually “read” everything.
So you think: maybe I should use an Agent? Let it process in chunks?
But the Agent runs for 20 minutes and keeps going in circles…
⸻
⚡ TL;DR
- Core problem: Single calls aren’t enough, but Agents are overkill—what’s in between?
- Key insight: It’s not binary. It’s a spectrum: Single Call → CoT → RAG → Manual Orchestration → Agent
- Decision factors: Data size, validation needs, control level, cost budget
- Common mistake: Jumping straight from single calls to Agents, skipping the middle options
It’s Not Binary—It’s a Spectrum
Many people think of LLM usage as “single call vs. Agent.”
That’s wrong.
It’s actually a spectrum with many options in between:
flowchart LR
A["Single Call
The basics"] --> B["CoT
Step-by-step thinking"]
B --> C["RAG
Retrieve then generate"]
C --> D["Manual Orchestration
You control the flow"]
D --> E["Agent
AI decides next step"]
E --> F["Deep Decomposition
Extreme cases"]
style A fill:#e8f5e9
style B fill:#e3f2fd
style C fill:#fff3e0
style D fill:#fce4ec
style E fill:#f3e5f5
style F fill:#ffebee
Simple → Complex | Cheap → Expensive | Fast → Slow
Each method solves different problems:
| Method | Solves What | Typical Use Case |
|---|---|---|
| Single Call | Simple tasks | Translation, summarization, Q&A |
| CoT Prompting | Poor reasoning quality | Math, logic, complex analysis |
| RAG | Data too long or needs external knowledge | Long documents, knowledge base queries |
| Manual Orchestration | Need intermediate validation | Fixed workflows, staged processing |
| Agent | Unknown number of steps, dynamic decisions | Exploratory tasks, tool chaining |
| Deep Decomposition (RLM, etc.) | Extreme length, deep recursion | Million-line logs, huge codebases |
Key Insight: 80% of tasks can be solved with the first three methods. Jumping straight to Agents is usually over-engineering.
Method 1: Single Call (Baseline)
The simplest approach: one prompt, one response.
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Translate this to English: ..."}]
)
When it works:
- Problem is clear, no exploration needed
- Data fits in context window (and isn’t too long)
- Errors are acceptable or easy to catch manually
- Need fast response
Examples:
- ✅ Translating an email
- ✅ Summarizing a 3000-word article
- ✅ Answering specific technical questions (“How to read JSON in Python?”)
- ✅ Generating short code snippets (under 50 lines)
Cost estimate: 1 API call, ~1K-10K tokens
Method 2: Chain-of-Thought (Step-by-Step Thinking)
When single calls don’t reason well enough—the lightest upgrade.
Core concept: Not multiple API calls, but asking the LLM to think step by step within one response.
Before (Single Call)
Prompt: "What's wrong with this function?"
Response: "Looks fine to me." ← Might miss details
After (CoT Prompting)
Prompt: "What's wrong with this function? Please analyze step by step:
1. First check input validation
2. Then check edge cases
3. Finally check error handling"
Response:
"1. Input validation: No null check...
2. Edge cases: Crashes when array is empty...
3. Error handling: Missing try-catch..." ← More thorough
When to use:
- Complex reasoning (math, logic)
- Need more careful analysis
- Single calls often miss details
Cost estimate: Still 1 API call, but more output tokens (~1.5-2x)
Key Insight: CoT is a “free upgrade”—no architecture changes, just change your prompt. If single calls aren’t working, try CoT before anything else.
Method 3: RAG (Retrieve Then Generate)
When data is too long, or you need external knowledge.
Core concept: Don’t stuff everything into context. Find relevant chunks first, then let the LLM process them.
flowchart TD
Q["❓ Your question
What's the risk assessment conclusion?"]
Q --> S["🔍 Step 1: Vector search
Find relevant sections from 200 pages"]
S --> R["📄 Step 2: Filter results
Keep only the relevant 5 pages"]
R --> L["🤖 Step 3: LLM generates
Answer based on those 5 pages"]
style Q fill:#e3f2fd
style S fill:#fff3e0
style R fill:#e8f5e9
style L fill:#f3e5f5
Why it works:
- Avoids “Lost in the Middle” problem (LLMs tend to ignore middle sections of long contexts)
- Reduces cost (only process relevant chunks)
- Can handle data exceeding context window limits
When to use:
- Long document analysis (100+ pages)
- Knowledge base Q&A
- Scenarios requiring source citations
Cost estimate: Vector search + 1 API call, usually cheaper than stuffing full context
Common tools: LangChain, LlamaIndex, Pinecone, Chroma
Method 4: Manual Orchestration
When you need intermediate validation, or the workflow is fixed.
Core concept: You decide the flow, calling LLM at each step.
Example: Analyzing 10 files for code architecture
summaries = []
for file in files:
summary = llm.analyze(f"Analyze main functions in: {file}")
summaries.append(summary)
# Step 2: Synthesize all summaries
architecture = llm.synthesize(f"Based on these summaries, describe the architecture: {summaries}")
# Step 3: Generate final report
report = llm.generate(f"Based on the architecture analysis, create a technical report: {architecture}")
Manual Orchestration vs. Agent:
| Manual Orchestration | Agent | |
|---|---|---|
| Who decides next step? | You (the engineer) | AI |
| Is the flow fixed? | Yes | Dynamic |
| Predictability | High | Low |
| Best for | Known workflows | Unknown workflows |
When to use:
- Multi-step process, but each step is predetermined
- Need to add validation or human review between steps
- Want maximum control
Cost estimate: N API calls (N = number of steps)
Key Insight: Many people jump straight to Agents, but Manual Orchestration is more controllable, cheaper, and easier to debug. Ask yourself: can I hardcode this workflow?
Method 5: Agent (AI Decides Next Step)
When you don’t know how many steps are needed, or need dynamic decisions.
Core concept: Let the AI decide what to do next, until the task is complete.
flowchart TD
T["🎯 Task
Find the root cause of this bug"]
T --> A1["🤔 Agent thinks
I need to check the error log first"]
A1 -->|"calls read_file"| A2["🤔 Agent thinks
Error points to db.py, let me check"]
A2 -->|"calls read_file"| A3["🤔 Agent thinks
Found unclosed connection, let me verify..."]
A3 -->|"calls search"| R["✅ Conclusion
Root cause: connection pool has no timeout"]
style T fill:#e3f2fd
style A1 fill:#fff3e0
style A2 fill:#fff3e0
style A3 fill:#fff3e0
style R fill:#e8f5e9
When to use:
- Exploratory tasks (don’t know where the answer is)
- Need to chain multiple tools
- Variable number of steps
Risks:
- May loop endlessly (retrying the same error)
- May over-decompose (turning simple problems into 10 steps)
- Unpredictable costs
Common frameworks: LangChain Agent, CrewAI, AutoGen, Claude Code
Cost estimate: Unpredictable—could be 5 calls, could be 50
Method 6: Deep Decomposition (RLM, etc.)
When handling extremely long content or needing true recursion.
This is the newest, heaviest approach—for extreme cases.
RLM (Recursive Language Models) is a framework from MIT that enables LLMs to:
- Store long text as variables instead of stuffing into prompts
- Dynamically partition and recursively search
- Maintain performance at million-token scale
When to use:
- Finding specific errors in 1 million lines of logs
- Analyzing massive codebases
- Extreme long-form content where other methods fail
Current recommendation: Unless your task is truly extreme, use the first 5 methods first.
Decision Framework: 5 Questions to Choose
Not sure which method to use? Ask yourself these questions in order:
flowchart TD
Start["🤔 My task"] --> Q1{"1️⃣ Is single call
good enough?"}
Q1 -->|"Yes"| A1["✅ Single Call"]
Q1 -->|"No, reasoning too shallow"| A2["✅ Try CoT"]
Q1 -->|"No, data too long"| Q2{"2️⃣ Need retrieval
or full processing?"}
Q2 -->|"Need retrieval"| A3["✅ Use RAG"]
Q2 -->|"Need to see everything"| Q3{"3️⃣ Is the workflow
fixed?"}
Q3 -->|"Yes, I know each step"| A4["✅ Manual Orchestration"]
Q3 -->|"No, need dynamic decisions"| Q4{"4️⃣ How much
data?"}
Q4 -->|"< 100K tokens"| A5["✅ Agent"]
Q4 -->|"> 100K tokens"| A6["✅ Deep Decomposition (RLM)"]
style A1 fill:#e8f5e9
style A2 fill:#e8f5e9
style A3 fill:#e8f5e9
style A4 fill:#e8f5e9
style A5 fill:#e8f5e9
style A6 fill:#e8f5e9
Budget note: If time or budget is limited, start with simpler methods and upgrade gradually.
Quick Reference Table
| Your Situation | Recommended Method |
|---|---|
| Simple task, need fast response | Single Call |
| Reasoning quality not enough | CoT Prompting |
| Document too long | RAG |
| Need staged validation | Manual Orchestration |
| Don’t know how many steps | Agent |
| Extreme length (million+ tokens) | RLM |
Cost Comparison
For the same task—”analyze 10 files”:
| Method | API Calls | Token Usage | Approx. Cost (GPT-4o) |
|---|---|---|---|
| Single call (stuff all) | 1 | 50K input + 2K output | ~$0.30 |
| CoT | 1 | 50K input + 5K output | ~$0.32 |
| RAG | 1 | 10K input + 2K output | ~$0.07 |
| Manual (10 steps) | 10 | 5K × 10 = 50K total | ~$0.30 |
| Agent (assume 15 steps) | 15 | Variable, ~80K total | ~$0.50+ |
Takeaway: RAG is usually cheapest because it only processes relevant chunks. Agents are most expensive and unpredictable.
Common Misconceptions
Misconception 1: “Just use an Agent—it’s easier”
Wrong. Agent problems:
- Unpredictable costs
- Tends to loop
- Hard to debug
Recommendation: First ask “Can I hardcode this workflow?” If yes, Manual Orchestration is better.
Misconception 2: “Bigger context window is always better”
Not quite. Research shows long contexts have issues:
- Middle content tends to be ignored
- Response quality may decrease
- Cost increases linearly
128K context ≠ 128K effectively processed
Misconception 3: “RAG is only for knowledge bases”
Wrong. RAG’s core is “retrieve then generate”—useful for:
- Long document analysis
- Historical conversation reference
- Any scenario where you don’t need everything, just relevant parts
Misconception 4: “CoT requires special models”
Wrong. Any LLM supports CoT—just add “Please analyze step by step” to your prompt.
Next Steps
If you’re currently only using single calls
- Try CoT first: Add “Please analyze step by step” to your prompt
- Observe results: Did reasoning quality improve?
- Then consider RAG: If data is too long
If you’re considering Agents
- First ask: Can I really not hardcode this workflow?
- Try Manual Orchestration first: Control each step yourself
- Set limits: If using Agents, set max steps and timeouts
If you’re handling very long documents
- Try RAG first: Works for most cases
- Evaluate needs: Do you really need to “see everything”?
- Then consider RLM: Only for extreme scenarios
Related Reading
Want to dive deeper into AI engineering practices? These articles might help:
-
Not Every Company Should Build RAG—Here’s How to Decide
A framework for evaluating RAG readiness and applicability
-
Using AI to Tackle Legacy Code—What About That 4000-Line Function?
Agent in practice: incremental refactoring of large codebases with AI
-
The 10-Minute Value: What Does “Senior” Mean in the AI Era?
AI is redefining seniority—judgment matters more than output
-
Your Team Started Using AI to Code. Why Isn’t Productivity Improving?
AI engineering maturity model: from individual efficiency to team collaboration
-
A realistic look at AI-assisted coding: effects and limitations
Sources
Chain-of-Thought Prompting
- Chain-of-Thought Prompting Elicits Reasoning (2022) | Archive
Google research: step-by-step prompting improves reasoning, especially for math and logic.
Lost in the Middle Phenomenon
- Lost in the Middle (2023) | Archive
LLMs have lower recall for middle sections of long texts. Later models improved but issue persists.
Recursive Language Models
- RLM: Recursive Language Models (2025) | Archive
MIT framework (Zhang, Kraska, Khattab) for processing million-token contexts recursively.
Agent Design Patterns
- Building Effective Agents – Anthropic (2024) | Archive
Anthropic guide: “workflows” (fixed) are usually more reliable than “agents” (dynamic).