Single Call vs Agent: A Spectrum of LLM Strategies

🌏 閱讀中文版本

Table of Contents

The AI Missed the Key Finding in a 200-Page Report

You ask ChatGPT to analyze a 200-page report.

It quickly gives you a summary: “The key points are A, B, C…”

You breathe a sigh of relief, ready to paste the conclusion into your slides.

But then!

A colleague asks: “What about the risk assessment on page 150?”

You check the report. That section was completely ignored.

⸻

Fitting into the context window doesn’t mean the LLM actually “read” everything.

So you think: maybe I should use an Agent? Let it process in chunks?

But the Agent runs for 20 minutes and keeps going in circles…

⸻

⚡ TL;DR

Core problem: Single calls aren’t enough, but Agents are overkill—what’s in between?

Key insight: It’s not binary. It’s a spectrum: Single Call → CoT → RAG → Manual Orchestration → Agent

Decision factors: Data size, validation needs, control level, cost budget

Common mistake: Jumping straight from single calls to Agents, skipping the middle options

It’s Not Binary—It’s a Spectrum

Many people think of LLM usage as “single call vs. Agent.”

That’s wrong.

It’s actually a spectrum with many options in between:

flowchart LR
    A["Single Call
The basics"] --> B["CoT
Step-by-step thinking"]
    B --> C["RAG
Retrieve then generate"]
    C --> D["Manual Orchestration
You control the flow"]
    D --> E["Agent
AI decides next step"]
    E --> F["Deep Decomposition
Extreme cases"]

    style A fill:#e8f5e9
    style B fill:#e3f2fd
    style C fill:#fff3e0
    style D fill:#fce4ec
    style E fill:#f3e5f5
    style F fill:#ffebee

Simple → Complex ｜ Cheap → Expensive ｜ Fast → Slow

Each method solves different problems:

Method	Solves What	Typical Use Case
Single Call	Simple tasks	Translation, summarization, Q&A
CoT Prompting	Poor reasoning quality	Math, logic, complex analysis
RAG	Data too long or needs external knowledge	Long documents, knowledge base queries
Manual Orchestration	Need intermediate validation	Fixed workflows, staged processing
Agent	Unknown number of steps, dynamic decisions	Exploratory tasks, tool chaining
Deep Decomposition (RLM, etc.)	Extreme length, deep recursion	Million-line logs, huge codebases

Key Insight: 80% of tasks can be solved with the first three methods. Jumping straight to Agents is usually over-engineering.

Method 1: Single Call (Baseline)

The simplest approach: one prompt, one response.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Translate this to English: ..."}]
)

When it works:

Problem is clear, no exploration needed
Data fits in context window (and isn’t too long)
Errors are acceptable or easy to catch manually
Need fast response

Examples:

✅ Translating an email
✅ Summarizing a 3000-word article
✅ Answering specific technical questions (“How to read JSON in Python?”)
✅ Generating short code snippets (under 50 lines)

Cost estimate: 1 API call, ~1K-10K tokens

Method 2: Chain-of-Thought (Step-by-Step Thinking)

When single calls don’t reason well enough—the lightest upgrade.

Core concept: Not multiple API calls, but asking the LLM to think step by step within one response.

Before (Single Call)

Prompt: "What's wrong with this function?"

Response: "Looks fine to me."  ← Might miss details

After (CoT Prompting)

Prompt: "What's wrong with this function? Please analyze step by step:
1. First check input validation
2. Then check edge cases
3. Finally check error handling"

Response:
"1. Input validation: No null check...
 2. Edge cases: Crashes when array is empty...
 3. Error handling: Missing try-catch..."  ← More thorough

When to use:

Complex reasoning (math, logic)
Need more careful analysis
Single calls often miss details

Cost estimate: Still 1 API call, but more output tokens (~1.5-2x)

Key Insight: CoT is a “free upgrade”—no architecture changes, just change your prompt. If single calls aren’t working, try CoT before anything else.

Method 3: RAG (Retrieve Then Generate)

When data is too long, or you need external knowledge.

Core concept: Don’t stuff everything into context. Find relevant chunks first, then let the LLM process them.

flowchart TD
    Q["❓ Your question
What's the risk assessment conclusion?"]
    Q --> S["🔍 Step 1: Vector search
Find relevant sections from 200 pages"]
    S --> R["📄 Step 2: Filter results
Keep only the relevant 5 pages"]
    R --> L["🤖 Step 3: LLM generates
Answer based on those 5 pages"]

    style Q fill:#e3f2fd
    style S fill:#fff3e0
    style R fill:#e8f5e9
    style L fill:#f3e5f5

Why it works:

Avoids “Lost in the Middle” problem (LLMs tend to ignore middle sections of long contexts)
Reduces cost (only process relevant chunks)
Can handle data exceeding context window limits

When to use:

Long document analysis (100+ pages)
Knowledge base Q&A
Scenarios requiring source citations

Cost estimate: Vector search + 1 API call, usually cheaper than stuffing full context

Common tools: LangChain, LlamaIndex, Pinecone, Chroma

Method 4: Manual Orchestration

When you need intermediate validation, or the workflow is fixed.

Core concept: You decide the flow, calling LLM at each step.

Example: Analyzing 10 files for code architecture

summaries = []
for file in files:
    summary = llm.analyze(f"Analyze main functions in: {file}")
    summaries.append(summary)

# Step 2: Synthesize all summaries
architecture = llm.synthesize(f"Based on these summaries, describe the architecture: {summaries}")

# Step 3: Generate final report
report = llm.generate(f"Based on the architecture analysis, create a technical report: {architecture}")

Manual Orchestration vs. Agent:

	Manual Orchestration	Agent
Who decides next step?	You (the engineer)	AI
Is the flow fixed?	Yes	Dynamic
Predictability	High	Low
Best for	Known workflows	Unknown workflows

When to use:

Multi-step process, but each step is predetermined
Need to add validation or human review between steps
Want maximum control

Cost estimate: N API calls (N = number of steps)

Key Insight: Many people jump straight to Agents, but Manual Orchestration is more controllable, cheaper, and easier to debug. Ask yourself: can I hardcode this workflow?

Method 5: Agent (AI Decides Next Step)

When you don’t know how many steps are needed, or need dynamic decisions.

Core concept: Let the AI decide what to do next, until the task is complete.

flowchart TD
    T["🎯 Task
Find the root cause of this bug"]
    T --> A1["🤔 Agent thinks
I need to check the error log first"]
    A1 -->|"calls read_file"| A2["🤔 Agent thinks
Error points to db.py, let me check"]
    A2 -->|"calls read_file"| A3["🤔 Agent thinks
Found unclosed connection, let me verify..."]
    A3 -->|"calls search"| R["✅ Conclusion
Root cause: connection pool has no timeout"]

    style T fill:#e3f2fd
    style A1 fill:#fff3e0
    style A2 fill:#fff3e0
    style A3 fill:#fff3e0
    style R fill:#e8f5e9

When to use:

Exploratory tasks (don’t know where the answer is)
Need to chain multiple tools
Variable number of steps

Risks:

May loop endlessly (retrying the same error)
May over-decompose (turning simple problems into 10 steps)
Unpredictable costs

Common frameworks: LangChain Agent, CrewAI, AutoGen, Claude Code

Cost estimate: Unpredictable—could be 5 calls, could be 50

Method 6: Deep Decomposition (RLM, etc.)

When handling extremely long content or needing true recursion.

This is the newest, heaviest approach—for extreme cases.

RLM (Recursive Language Models) is a framework from MIT that enables LLMs to:

Store long text as variables instead of stuffing into prompts
Dynamically partition and recursively search
Maintain performance at million-token scale

When to use:

Finding specific errors in 1 million lines of logs
Analyzing massive codebases
Extreme long-form content where other methods fail

Current recommendation: Unless your task is truly extreme, use the first 5 methods first.

Decision Framework: 5 Questions to Choose

Not sure which method to use? Ask yourself these questions in order:

flowchart TD
    Start["🤔 My task"] --> Q1{"1️⃣ Is single call
good enough?"}

    Q1 -->|"Yes"| A1["✅ Single Call"]
    Q1 -->|"No, reasoning too shallow"| A2["✅ Try CoT"]
    Q1 -->|"No, data too long"| Q2{"2️⃣ Need retrieval
or full processing?"}

    Q2 -->|"Need retrieval"| A3["✅ Use RAG"]
    Q2 -->|"Need to see everything"| Q3{"3️⃣ Is the workflow
fixed?"}

    Q3 -->|"Yes, I know each step"| A4["✅ Manual Orchestration"]
    Q3 -->|"No, need dynamic decisions"| Q4{"4️⃣ How much
data?"}

    Q4 -->|"< 100K tokens"| A5["✅ Agent"]
    Q4 -->|"> 100K tokens"| A6["✅ Deep Decomposition (RLM)"]

    style A1 fill:#e8f5e9
    style A2 fill:#e8f5e9
    style A3 fill:#e8f5e9
    style A4 fill:#e8f5e9
    style A5 fill:#e8f5e9
    style A6 fill:#e8f5e9

Budget note: If time or budget is limited, start with simpler methods and upgrade gradually.

Quick Reference Table

Your Situation	Recommended Method
Simple task, need fast response	Single Call
Reasoning quality not enough	CoT Prompting
Document too long	RAG
Need staged validation	Manual Orchestration
Don’t know how many steps	Agent
Extreme length (million+ tokens)	RLM

Cost Comparison

For the same task—”analyze 10 files”:

Method	API Calls	Token Usage	Approx. Cost (GPT-4o)
Single call (stuff all)	1	50K input + 2K output	~$0.30
CoT	1	50K input + 5K output	~$0.32
RAG	1	10K input + 2K output	~$0.07
Manual (10 steps)	10	5K × 10 = 50K total	~$0.30
Agent (assume 15 steps)	15	Variable, ~80K total	~$0.50+

Takeaway: RAG is usually cheapest because it only processes relevant chunks. Agents are most expensive and unpredictable.

Common Misconceptions

Misconception 1: “Just use an Agent—it’s easier”

Wrong. Agent problems:

Unpredictable costs
Tends to loop
Hard to debug

Recommendation: First ask “Can I hardcode this workflow?” If yes, Manual Orchestration is better.

Misconception 2: “Bigger context window is always better”

Not quite. Research shows long contexts have issues:

Middle content tends to be ignored
Response quality may decrease
Cost increases linearly

128K context ≠ 128K effectively processed

Misconception 3: “RAG is only for knowledge bases”

Wrong. RAG’s core is “retrieve then generate”—useful for:

Long document analysis
Historical conversation reference
Any scenario where you don’t need everything, just relevant parts

Misconception 4: “CoT requires special models”

Wrong. Any LLM supports CoT—just add “Please analyze step by step” to your prompt.

Next Steps

If you’re currently only using single calls

Try CoT first: Add “Please analyze step by step” to your prompt
Observe results: Did reasoning quality improve?
Then consider RAG: If data is too long

If you’re considering Agents

First ask: Can I really not hardcode this workflow?
Try Manual Orchestration first: Control each step yourself
Set limits: If using Agents, set max steps and timeouts

If you’re handling very long documents

Try RAG first: Works for most cases
Evaluate needs: Do you really need to “see everything”?
Then consider RLM: Only for extreme scenarios

Want to dive deeper into AI engineering practices? These articles might help:

Not Every Company Should Build RAG—Here’s How to Decide

A framework for evaluating RAG readiness and applicability
Using AI to Tackle Legacy Code—What About That 4000-Line Function?

Agent in practice: incremental refactoring of large codebases with AI
The 10-Minute Value: What Does “Senior” Mean in the AI Era?

AI is redefining seniority—judgment matters more than output
Your Team Started Using AI to Code. Why Isn’t Productivity Improving?

AI engineering maturity model: from individual efficiency to team collaboration
Copilot Isn’t a Magic Pill

A realistic look at AI-assisted coding: effects and limitations

Sources

Chain-of-Thought Prompting

Chain-of-Thought Prompting Elicits Reasoning (2022) | Archive

Google research: step-by-step prompting improves reasoning, especially for math and logic.

Lost in the Middle Phenomenon

Lost in the Middle (2023) | Archive

LLMs have lower recall for middle sections of long texts. Later models improved but issue persists.

Recursive Language Models

RLM: Recursive Language Models (2025) | Archive

MIT framework (Zhang, Kraska, Khattab) for processing million-token contexts recursively.

Agent Design Patterns

Building Effective Agents – Anthropic (2024) | Archive

Anthropic guide: “workflows” (fixed) are usually more reliable than “agents” (dynamic).

The AI Missed the Key Finding in a 200-Page Report

It’s Not Binary—It’s a Spectrum

Method 1: Single Call (Baseline)

Method 2: Chain-of-Thought (Step-by-Step Thinking)

Before (Single Call)

After (CoT Prompting)

Method 3: RAG (Retrieve Then Generate)

Method 4: Manual Orchestration

Example: Analyzing 10 files for code architecture

Method 5: Agent (AI Decides Next Step)

Method 6: Deep Decomposition (RLM, etc.)

Decision Framework: 5 Questions to Choose

Quick Reference Table

Cost Comparison

Common Misconceptions

Misconception 1: “Just use an Agent—it’s easier”

Misconception 2: “Bigger context window is always better”

Misconception 3: “RAG is only for knowledge bases”

Misconception 4: “CoT requires special models”

Next Steps

If you’re currently only using single calls

If you’re considering Agents

If you’re handling very long documents

Related Reading

Sources

Chain-of-Thought Prompting

Lost in the Middle Phenomenon

Recursive Language Models

Agent Design Patterns

Related posts:

Leave a Comment Cancel reply