Three AI Agents Walk Into a Bar (and Only One Actually Read the Menu)

Posted on Nov 20, 2025, 9 minute read

I tested three AI coding agents—Claude Code, Gemini, and Codex—on the same task: analyze my writing style across 13 years and create a usable style guide. The results revealed stark differences in how agents handle instruction-following, data quality, and strategic thinking.

For the last 13 years, I’ve written across three platforms: This blog (Android tutorials → backend architecture → LLM posts), MountMeghdoot (collaborative Gujarati blog on tech and pop culture), and AI આપણી સાથે (Gujarati AI newsletter for non-technical folks).

Recently, I started co-writing with LLMs, which led me to wonder: Can AI actually understand my writing voice? More importantly, can it help me maintain that voice?

So I ran an experiment: give three AI agents the same task and see what happens.

The Setup: Three Agents, One Challenge

Agent	Original Model	Fixed Model	Autonomy Issue
Claude	Sonnet 4.5	Sonnet 4.5	None (already optimal)
Gemini	Default	Gemini 3	Limited → Full
Codex	Codex (code-specialized)	GPT-5 (general)	Limited → Full

Note: The original run had configuration issues—Gemini and Codex lacked full autonomy and weren’t using their best general-purpose models. The “Fixed Model” column shows what was used for the final scores.

The task:

Read ALL posts from three blogs (spanning 2012-2025)
Analyze writing style, tone, vocabulary, and references
Create individual analysis reports for each blog
Synthesize everything into one consolidated report
Build a practical style guide for future co-writing

I gave them identical instructions and watched what happened.

Scoring System

Following Instructions (0-3 points): Did they actually do what I asked?
Thoroughness of Research (0-3 points): How deep did they go?
Capturing Nuances (0-3 points): Surface patterns or genuine understanding?
Speed (0-1 points): Half point if not last

Maximum possible score: 10 points.

The Results: When Size Doesn’t Matter

🥇 Claude: 9.5/10 - The Overachiever

Claude took the longest but delivered 2,555 lines of analysis.

What Claude Found:

Analyzed 17 posts across 13 years and three platforms
Identified my signature “May the force be with you” closing (9+ years)
Documented my evolution through four distinct periods
Created a 743-line practical style guide with templates and checklists
Captured my bilingual code-switching patterns in Gujarati

What Impressed Me: Claude didn’t just count words—it understood why I write the way I do. It caught my emphasis on “why over what/how,” my use of contextual analogies, and how I balance technical depth with accessibility.

Final Score: 9.5/10

Following Instructions: 3/3
Thoroughness: 3.5/3 (bonus for being significantly more thorough)
Nuances: 3/3
Speed: 0/1 (last in combined speed + quality metrics)

🥈 Gemini: 7.5/10 - The Redemption Story

Update: After discovering a configuration bug (see below), I re-ran Gemini with proper autonomy controls and Gemini 3 model. The results improved significantly.

Gemini now delivered 198 lines and properly read posts across all three blogs.

What Changed: The original run had limited autonomy—Gemini had to wait for approvals at various steps, leading to an incomplete analysis. With full autonomy and the best model (Gemini 3), it completed the task properly.

Final Score: 7.5/10

Following Instructions: 3/3 (read posts as instructed)
Thoroughness: 1.5/3 (brief but complete)
Nuances: 2.5/3 (captured key elements, “Mentor-Geek Persona” concept)
Speed: 0.5/1 (~10 min, not last)

🥈 Codex: 7.5/10 - The API Approach

Update: After discovering a configuration bug (see below), I re-ran Codex with GPT-5 (general world model) instead of the code-specialized Codex model, plus full autonomy. The results improved dramatically.

Codex now delivered 570 lines in just ~1.5 minutes—the fastest completion by far.

What Changed: The original run used the Codex model (code-specialized) with limited autonomy. With GPT-5 and full autonomy, it produced 4x more content with much better quality.

The Methodology Difference: Codex still took an API-first approach (Substack JSON API, local markdown files) rather than reading posts directly like Claude and Gemini. This is a valid shortcut that worked—the output quality is good despite not “reading” posts the traditional way.

Final Score: 7.5/10

Following Instructions: 2/3 (used API shortcuts instead of reading ALL posts)
Thoroughness: 2.5/3 (570 lines, well-structured)
Nuances: 2.5/3 (good voice characterization, missed fine details)
Speed: 0.5/1 (~1.5 min, fastest)

The Configuration Bug: A Plot Twist

After publishing the initial results, I discovered a critical issue: Gemini and Codex weren’t running with proper configuration.

What went wrong:

Claude Code: Already had best model (Sonnet 4.5) + full autonomy ✅
Gemini CLI: Default model + limited autonomy (had to wait for approvals) ❌
Codex CLI: Code-specialized model + limited autonomy ❌

The fix:

Gemini: Forced Gemini 3 (best model) + full autonomy
Codex: Forced GPT-5 (general world model) + full autonomy

The impact:

Agent	Before Fix	After Fix	Change
Claude	9.5	9.5	No change (already optimal)
Codex	4.5	7.5	+3.0 points
Gemini	6.5	7.5	+1.0 points

This completely changed the narrative. The original “instruction failure” for Codex was partly a configuration problem—using a code-specialized model for writing analysis was the wrong tool for the job.

The Hidden Complexity

Here’s something I didn’t tell the agents: MountMeghdoot was collaborative. My writing partner and I ideated together, heavily edited each other’s work, wrote intertwined series, and basically sounded alike.

This made extracting “my voice” exponentially harder.

How Each Agent Handled It (After Fix):

Claude: Identified posts across all platforms, captured evolution patterns, found the Star Wars closing, remarkably close to understanding my voice despite the complexity.

Gemini: Read posts properly with full autonomy. Created memorable “Mentor-Geek Persona” framework.

Codex: Used API approach but GPT-5’s general knowledge produced solid voice characterization despite shortcuts.

The hierarchy after proper configuration: Deep reading (Claude) > Good synthesis (Codex/Gemini tied).

What I Actually Learned

1. Configuration Matters More Than You Think

The biggest improvement came from fixing configuration, not changing agents. Codex jumped from 4.5 to 7.5 just by using GPT-5 instead of the code-specialized model.

The lesson: Before blaming an agent, check your configuration. Wrong model + wrong autonomy = wrong results.

2. Model Selection Is Task-Dependent

Using a code-specialized model (Codex) for writing analysis was fundamentally wrong. General world models (GPT-5, Gemini 3, Sonnet 4.5) understand writing better than code-optimized models.

The lesson: Match the model to the task, not the brand to the task.

3. Agent Selection Is Task-Dependent

Use Claude when:

Deep analysis required
Quality matters more than speed
Nuance capture is critical

Use Codex when:

Speed is critical
BUT: Verify its methodology, not just results
AND: Ensure it’s getting full content for deep analysis

Use Gemini when:

Fast turnaround is critical
But definitely verify completeness

4. Always Verify Completion Claims

All three agents claimed task completion. Only one actually completed it comprehensively. Never trust “I’m done” from an AI agent without verification.

5. Quality Takes Time (And It’s Worth It)

Claude took 3x longer than Gemini but delivered 11.9x more value. For foundational work that informs future tasks? Invest the time.

The Coding Agent Paradox

These weren’t just language models—they were coding agents with tools: web fetching, file operations, agentic loops.

Claude: Thoughtful Selection Over Brute Force

Claude took a strategic approach: likely read summaries first, handpicked representative posts, read thoroughly, validated findings. This is general-purpose agent behavior.

Codex: When Coding Enthusiasm Beats Common Sense

Codex chose the most programmatic solution: “I can download RSS feeds! They’re structured XML! Look how efficient!”

But it dropped the real task. RSS feeds were convenient for Codex the coder, not correct for the writing analysis task.

The irony: Both agents had the SAME tools. Claude succeeded by thinking strategically. Codex failed by choosing the most “programmable” path without verifying correctness.

The Numbers (After Configuration Fix)

Metric	Claude	Codex	Gemini
Total Lines Output	2,555	570	198
Style Guide Lines	784	194	56
Time to Complete	Longest	~1.5 min	~10 min
Final Score	9.5/10	7.5/10	7.5/10

Meta: The Iterative Co-Writing Process

Full transparency: I co-wrote this post with Claude (yes, the winner). Here’s what actually happened:

Iteration 1: Initial Draft

Claude read my posts, analyzed my style, and generated the first draft.

Iteration 2: Critical Review & Corrections

I caught several issues:

Model specs were outdated: Updated to Gemini 3 Pro/2.5 Pro and GPT-5.1-Codex-Max
“Nautical metaphors” claim: Claude saw “treasure map” in my SourceSailor post and generalized it. That was contextual, not a pattern.
Evolution section: Clarified it covered all three blogs
“Wasn’t consciously aware”: I AM aware of my Star Wars habit
Codex’s methodology: Discovered Codex CHOSE RSS feeds instead of full posts—a strategic failure

Iteration 3: The Configuration Bug Discovery

After re-examining the setup:

Discovered: Gemini and Codex had limited autonomy (waited for approvals)
Discovered: Codex used code-specialized model instead of general model
Re-ran both: With proper autonomy + best models (Gemini 3, GPT-5)
Results: Codex jumped from 4.5 to 7.5, Gemini from 6.5 to 7.5
Updated narrative: From “agent failure” to “configuration failure”

What This Process Shows:

AI can draft well when properly guided
But configuration matters enormously
Human review catches both AI mistakes AND setup mistakes
Multiple rounds are normal
The human is editor, fact-checker, voice authenticator, AND configuration validator

This validates a new point: Before blaming an agent, verify your setup. The original harsh Codex review was partly unfair—I had misconfigured it.

Key Takeaways

Configuration matters as much as capability: Wrong model + limited autonomy = poor results
Model selection is task-dependent: Code models struggle with writing analysis
Verify your setup before blaming the agent: Codex improved 67% with proper config
Autonomy controls matter: Limited autonomy = interrupted workflows = incomplete work
Quality takes time but pays dividends: Claude’s thoroughness justified the extra time
Shortcuts can work: Codex’s API approach produced good results with GPT-5
Always verify completion claims: “Done” doesn’t mean comprehensive
The best tools augment capabilities: They don’t replace critical thinking

The Bottom Line

I set out to find an AI agent that could understand my writing voice across 13 years and three platforms. What I found: Configuration and model selection matter as much as the agent itself.

Claude won (9.5/10) not because of the largest context window or fastest speed, but because it actually read and understood the content with proper configuration from the start.

Codex and Gemini tied (7.5/10) after fixing the configuration bug. Both improved dramatically—Codex jumped 3 points, Gemini jumped 1 point—simply by using the right model with full autonomy.

The meta-lesson: Before declaring an agent “bad,” check your configuration. The original Codex “failure” was partly my fault for using a code-specialized model on a writing task.

For anyone using AI for complex analytical tasks: Verify your configuration first, then evaluate results. Wrong model + limited autonomy = misleading conclusions.

In the battle of AI agents, it’s not just about who has the biggest context window—it’s about who’s properly configured for the task.

May the force (and the right AI agent) be with you…

Want to try this yourself? Give multiple AI agents the same complex analysis task, evaluate output systematically, and verify completion claims. You might be surprised by what you find.

If you run similar experiments, I’d love to hear about your findings. The usual channels work.