Single Call vs Agent: A Spectrum of LLM Strategies

🌏 閱讀中文版本


The AI Missed the Key Finding in a 200-Page Report

You ask ChatGPT to analyze a 200-page report.

It quickly gives you a summary: “The key points are A, B, C…”

You breathe a sigh of relief, ready to paste the conclusion into your slides.

But then!

A colleague asks: “What about the risk assessment on page 150?”

You check the report. That section was completely ignored.

Fitting into the context window doesn’t mean the LLM actually “read” everything.

So you think: maybe I should use an Agent? Let it process in chunks?

But the Agent runs for 20 minutes and keeps going in circles…

⚡ TL;DR

  • Core problem: Single calls aren’t enough, but Agents are overkill—what’s in between?
  • Key insight: It’s not binary. It’s a spectrum: Single Call → CoT → RAG → Manual Orchestration → Agent
  • Decision factors: Data size, validation needs, control level, cost budget
  • Common mistake: Jumping straight from single calls to Agents, skipping the middle options

It’s Not Binary—It’s a Spectrum

Many people think of LLM usage as “single call vs. Agent.”

That’s wrong.

It’s actually a spectrum with many options in between:

flowchart LR
    A["Single Call
The basics"] --> B["CoT
Step-by-step thinking"] B --> C["RAG
Retrieve then generate"] C --> D["Manual Orchestration
You control the flow"] D --> E["Agent
AI decides next step"] E --> F["Deep Decomposition
Extreme cases"] style A fill:#e8f5e9 style B fill:#e3f2fd style C fill:#fff3e0 style D fill:#fce4ec style E fill:#f3e5f5 style F fill:#ffebee

Simple → Complex | Cheap → Expensive | Fast → Slow

Each method solves different problems:

Method Solves What Typical Use Case
Single Call Simple tasks Translation, summarization, Q&A
CoT Prompting Poor reasoning quality Math, logic, complex analysis
RAG Data too long or needs external knowledge Long documents, knowledge base queries
Manual Orchestration Need intermediate validation Fixed workflows, staged processing
Agent Unknown number of steps, dynamic decisions Exploratory tasks, tool chaining
Deep Decomposition (RLM, etc.) Extreme length, deep recursion Million-line logs, huge codebases

Key Insight: 80% of tasks can be solved with the first three methods. Jumping straight to Agents is usually over-engineering.


Method 1: Single Call (Baseline)

The simplest approach: one prompt, one response.

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "Translate this to English: ..."}]
)

When it works:

  • Problem is clear, no exploration needed
  • Data fits in context window (and isn’t too long)
  • Errors are acceptable or easy to catch manually
  • Need fast response

Examples:

  • ✅ Translating an email
  • ✅ Summarizing a 3000-word article
  • ✅ Answering specific technical questions (“How to read JSON in Python?”)
  • ✅ Generating short code snippets (under 50 lines)

Cost estimate: 1 API call, ~1K-10K tokens


Method 2: Chain-of-Thought (Step-by-Step Thinking)

When single calls don’t reason well enough—the lightest upgrade.

Core concept: Not multiple API calls, but asking the LLM to think step by step within one response.

Before (Single Call)

Prompt: "What's wrong with this function?"

Response: "Looks fine to me."  ← Might miss details

After (CoT Prompting)

Prompt: "What's wrong with this function? Please analyze step by step:
1. First check input validation
2. Then check edge cases
3. Finally check error handling"

Response:
"1. Input validation: No null check...
 2. Edge cases: Crashes when array is empty...
 3. Error handling: Missing try-catch..."  ← More thorough

When to use:

  • Complex reasoning (math, logic)
  • Need more careful analysis
  • Single calls often miss details

Cost estimate: Still 1 API call, but more output tokens (~1.5-2x)

Key Insight: CoT is a “free upgrade”—no architecture changes, just change your prompt. If single calls aren’t working, try CoT before anything else.


Method 3: RAG (Retrieve Then Generate)

When data is too long, or you need external knowledge.

Core concept: Don’t stuff everything into context. Find relevant chunks first, then let the LLM process them.

flowchart TD
    Q["❓ Your question
What's the risk assessment conclusion?"] Q --> S["🔍 Step 1: Vector search
Find relevant sections from 200 pages"] S --> R["📄 Step 2: Filter results
Keep only the relevant 5 pages"] R --> L["🤖 Step 3: LLM generates
Answer based on those 5 pages"] style Q fill:#e3f2fd style S fill:#fff3e0 style R fill:#e8f5e9 style L fill:#f3e5f5

Why it works:

  • Avoids “Lost in the Middle” problem (LLMs tend to ignore middle sections of long contexts)
  • Reduces cost (only process relevant chunks)
  • Can handle data exceeding context window limits

When to use:

  • Long document analysis (100+ pages)
  • Knowledge base Q&A
  • Scenarios requiring source citations

Cost estimate: Vector search + 1 API call, usually cheaper than stuffing full context

Common tools: LangChain, LlamaIndex, Pinecone, Chroma


Method 4: Manual Orchestration

When you need intermediate validation, or the workflow is fixed.

Core concept: You decide the flow, calling LLM at each step.

Example: Analyzing 10 files for code architecture

summaries = []
for file in files:
    summary = llm.analyze(f"Analyze main functions in: {file}")
    summaries.append(summary)

# Step 2: Synthesize all summaries
architecture = llm.synthesize(f"Based on these summaries, describe the architecture: {summaries}")

# Step 3: Generate final report
report = llm.generate(f"Based on the architecture analysis, create a technical report: {architecture}")

Manual Orchestration vs. Agent:

Manual Orchestration Agent
Who decides next step? You (the engineer) AI
Is the flow fixed? Yes Dynamic
Predictability High Low
Best for Known workflows Unknown workflows

When to use:

  • Multi-step process, but each step is predetermined
  • Need to add validation or human review between steps
  • Want maximum control

Cost estimate: N API calls (N = number of steps)

Key Insight: Many people jump straight to Agents, but Manual Orchestration is more controllable, cheaper, and easier to debug. Ask yourself: can I hardcode this workflow?


Method 5: Agent (AI Decides Next Step)

When you don’t know how many steps are needed, or need dynamic decisions.

Core concept: Let the AI decide what to do next, until the task is complete.

flowchart TD
    T["🎯 Task
Find the root cause of this bug"] T --> A1["🤔 Agent thinks
I need to check the error log first"] A1 -->|"calls read_file"| A2["🤔 Agent thinks
Error points to db.py, let me check"] A2 -->|"calls read_file"| A3["🤔 Agent thinks
Found unclosed connection, let me verify..."] A3 -->|"calls search"| R["✅ Conclusion
Root cause: connection pool has no timeout"] style T fill:#e3f2fd style A1 fill:#fff3e0 style A2 fill:#fff3e0 style A3 fill:#fff3e0 style R fill:#e8f5e9

When to use:

  • Exploratory tasks (don’t know where the answer is)
  • Need to chain multiple tools
  • Variable number of steps

Risks:

  • May loop endlessly (retrying the same error)
  • May over-decompose (turning simple problems into 10 steps)
  • Unpredictable costs

Common frameworks: LangChain Agent, CrewAI, AutoGen, Claude Code

Cost estimate: Unpredictable—could be 5 calls, could be 50


Method 6: Deep Decomposition (RLM, etc.)

When handling extremely long content or needing true recursion.

This is the newest, heaviest approach—for extreme cases.

RLM (Recursive Language Models) is a framework from MIT that enables LLMs to:

  • Store long text as variables instead of stuffing into prompts
  • Dynamically partition and recursively search
  • Maintain performance at million-token scale

When to use:

  • Finding specific errors in 1 million lines of logs
  • Analyzing massive codebases
  • Extreme long-form content where other methods fail

Current recommendation: Unless your task is truly extreme, use the first 5 methods first.


Decision Framework: 5 Questions to Choose

Not sure which method to use? Ask yourself these questions in order:

flowchart TD
    Start["🤔 My task"] --> Q1{"1️⃣ Is single call
good enough?"} Q1 -->|"Yes"| A1["✅ Single Call"] Q1 -->|"No, reasoning too shallow"| A2["✅ Try CoT"] Q1 -->|"No, data too long"| Q2{"2️⃣ Need retrieval
or full processing?"} Q2 -->|"Need retrieval"| A3["✅ Use RAG"] Q2 -->|"Need to see everything"| Q3{"3️⃣ Is the workflow
fixed?"} Q3 -->|"Yes, I know each step"| A4["✅ Manual Orchestration"] Q3 -->|"No, need dynamic decisions"| Q4{"4️⃣ How much
data?"} Q4 -->|"< 100K tokens"| A5["✅ Agent"] Q4 -->|"> 100K tokens"| A6["✅ Deep Decomposition (RLM)"] style A1 fill:#e8f5e9 style A2 fill:#e8f5e9 style A3 fill:#e8f5e9 style A4 fill:#e8f5e9 style A5 fill:#e8f5e9 style A6 fill:#e8f5e9

Budget note: If time or budget is limited, start with simpler methods and upgrade gradually.

Quick Reference Table

Your Situation Recommended Method
Simple task, need fast response Single Call
Reasoning quality not enough CoT Prompting
Document too long RAG
Need staged validation Manual Orchestration
Don’t know how many steps Agent
Extreme length (million+ tokens) RLM

Cost Comparison

For the same task—”analyze 10 files”:

Method API Calls Token Usage Approx. Cost (GPT-4o)
Single call (stuff all) 1 50K input + 2K output ~$0.30
CoT 1 50K input + 5K output ~$0.32
RAG 1 10K input + 2K output ~$0.07
Manual (10 steps) 10 5K × 10 = 50K total ~$0.30
Agent (assume 15 steps) 15 Variable, ~80K total ~$0.50+

Takeaway: RAG is usually cheapest because it only processes relevant chunks. Agents are most expensive and unpredictable.


Common Misconceptions

Misconception 1: “Just use an Agent—it’s easier”

Wrong. Agent problems:

  • Unpredictable costs
  • Tends to loop
  • Hard to debug

Recommendation: First ask “Can I hardcode this workflow?” If yes, Manual Orchestration is better.

Misconception 2: “Bigger context window is always better”

Not quite. Research shows long contexts have issues:

  • Middle content tends to be ignored
  • Response quality may decrease
  • Cost increases linearly

128K context ≠ 128K effectively processed

Misconception 3: “RAG is only for knowledge bases”

Wrong. RAG’s core is “retrieve then generate”—useful for:

  • Long document analysis
  • Historical conversation reference
  • Any scenario where you don’t need everything, just relevant parts

Misconception 4: “CoT requires special models”

Wrong. Any LLM supports CoT—just add “Please analyze step by step” to your prompt.


Next Steps

If you’re currently only using single calls

  1. Try CoT first: Add “Please analyze step by step” to your prompt
  2. Observe results: Did reasoning quality improve?
  3. Then consider RAG: If data is too long

If you’re considering Agents

  1. First ask: Can I really not hardcode this workflow?
  2. Try Manual Orchestration first: Control each step yourself
  3. Set limits: If using Agents, set max steps and timeouts

If you’re handling very long documents

  1. Try RAG first: Works for most cases
  2. Evaluate needs: Do you really need to “see everything”?
  3. Then consider RLM: Only for extreme scenarios

Want to dive deeper into AI engineering practices? These articles might help:


Sources

Chain-of-Thought Prompting

Lost in the Middle Phenomenon

Recursive Language Models

Agent Design Patterns

Leave a Comment