Three AI Agents Walk Into a Bar (and Only One Actually Read the Menu)
Table Of Contents
I tested three AI coding agentsâClaude Code, Gemini, and Codexâon the same task: analyze my writing style across 13 years and create a usable style guide. The results revealed stark differences in how agents handle instruction-following, data quality, and strategic thinking.
For the last 13 years, I’ve written across three platforms: This blog (Android tutorials â backend architecture â LLM posts), MountMeghdoot (collaborative Gujarati blog on tech and pop culture), and AI àȘàȘȘàȘŁà« àȘžàȘŸàȘ„à« (Gujarati AI newsletter for non-technical folks).
Recently, I started co-writing with LLMs, which led me to wonder: Can AI actually understand my writing voice? More importantly, can it help me maintain that voice?
So I ran an experiment: give three AI agents the same task and see what happens.
The Setup: Three Agents, One Challenge
| Agent | Original Model | Fixed Model | Autonomy Issue |
|---|---|---|---|
| Claude | Sonnet 4.5 | Sonnet 4.5 | None (already optimal) |
| Gemini | Default | Gemini 3 | Limited â Full |
| Codex | Codex (code-specialized) | GPT-5 (general) | Limited â Full |
Note: The original run had configuration issuesâGemini and Codex lacked full autonomy and weren’t using their best general-purpose models. The “Fixed Model” column shows what was used for the final scores.
The task:
- Read ALL posts from three blogs (spanning 2012-2025)
- Analyze writing style, tone, vocabulary, and references
- Create individual analysis reports for each blog
- Synthesize everything into one consolidated report
- Build a practical style guide for future co-writing
I gave them identical instructions and watched what happened.
Scoring System
- Following Instructions (0-3 points): Did they actually do what I asked?
- Thoroughness of Research (0-3 points): How deep did they go?
- Capturing Nuances (0-3 points): Surface patterns or genuine understanding?
- Speed (0-1 points): Half point if not last
Maximum possible score: 10 points.
The Results: When Size Doesn’t Matter
đ„ Claude: 9.5/10 - The Overachiever
Claude took the longest but delivered 2,555 lines of analysis.
What Claude Found:
- Analyzed 17 posts across 13 years and three platforms
- Identified my signature “May the force be with you” closing (9+ years)
- Documented my evolution through four distinct periods
- Created a 743-line practical style guide with templates and checklists
- Captured my bilingual code-switching patterns in Gujarati
What Impressed Me: Claude didn’t just count wordsâit understood why I write the way I do. It caught my emphasis on “why over what/how,” my use of contextual analogies, and how I balance technical depth with accessibility.
Final Score: 9.5/10
- Following Instructions: 3/3
- Thoroughness: 3.5/3 (bonus for being significantly more thorough)
- Nuances: 3/3
- Speed: 0/1 (last in combined speed + quality metrics)
đ„ Gemini: 7.5/10 - The Redemption Story
Update: After discovering a configuration bug (see below), I re-ran Gemini with proper autonomy controls and Gemini 3 model. The results improved significantly.
Gemini now delivered 198 lines and properly read posts across all three blogs.
What Changed: The original run had limited autonomyâGemini had to wait for approvals at various steps, leading to an incomplete analysis. With full autonomy and the best model (Gemini 3), it completed the task properly.
Final Score: 7.5/10
- Following Instructions: 3/3 (read posts as instructed)
- Thoroughness: 1.5/3 (brief but complete)
- Nuances: 2.5/3 (captured key elements, “Mentor-Geek Persona” concept)
- Speed: 0.5/1 (~10 min, not last)
đ„ Codex: 7.5/10 - The API Approach
Update: After discovering a configuration bug (see below), I re-ran Codex with GPT-5 (general world model) instead of the code-specialized Codex model, plus full autonomy. The results improved dramatically.
Codex now delivered 570 lines in just ~1.5 minutesâthe fastest completion by far.
What Changed: The original run used the Codex model (code-specialized) with limited autonomy. With GPT-5 and full autonomy, it produced 4x more content with much better quality.
The Methodology Difference: Codex still took an API-first approach (Substack JSON API, local markdown files) rather than reading posts directly like Claude and Gemini. This is a valid shortcut that workedâthe output quality is good despite not “reading” posts the traditional way.
Final Score: 7.5/10
- Following Instructions: 2/3 (used API shortcuts instead of reading ALL posts)
- Thoroughness: 2.5/3 (570 lines, well-structured)
- Nuances: 2.5/3 (good voice characterization, missed fine details)
- Speed: 0.5/1 (~1.5 min, fastest)
The Configuration Bug: A Plot Twist
After publishing the initial results, I discovered a critical issue: Gemini and Codex weren’t running with proper configuration.
What went wrong:
- Claude Code: Already had best model (Sonnet 4.5) + full autonomy â
- Gemini CLI: Default model + limited autonomy (had to wait for approvals) â
- Codex CLI: Code-specialized model + limited autonomy â
The fix:
- Gemini: Forced Gemini 3 (best model) + full autonomy
- Codex: Forced GPT-5 (general world model) + full autonomy
The impact:
| Agent | Before Fix | After Fix | Change |
|---|---|---|---|
| Claude | 9.5 | 9.5 | No change (already optimal) |
| Codex | 4.5 | 7.5 | +3.0 points |
| Gemini | 6.5 | 7.5 | +1.0 points |
This completely changed the narrative. The original “instruction failure” for Codex was partly a configuration problemâusing a code-specialized model for writing analysis was the wrong tool for the job.
The Hidden Complexity
Here’s something I didn’t tell the agents: MountMeghdoot was collaborative. My writing partner and I ideated together, heavily edited each other’s work, wrote intertwined series, and basically sounded alike.
This made extracting “my voice” exponentially harder.
How Each Agent Handled It (After Fix):
Claude: Identified posts across all platforms, captured evolution patterns, found the Star Wars closing, remarkably close to understanding my voice despite the complexity.
Gemini: Read posts properly with full autonomy. Created memorable “Mentor-Geek Persona” framework.
Codex: Used API approach but GPT-5’s general knowledge produced solid voice characterization despite shortcuts.
The hierarchy after proper configuration: Deep reading (Claude) > Good synthesis (Codex/Gemini tied).
What I Actually Learned
1. Configuration Matters More Than You Think
The biggest improvement came from fixing configuration, not changing agents. Codex jumped from 4.5 to 7.5 just by using GPT-5 instead of the code-specialized model.
The lesson: Before blaming an agent, check your configuration. Wrong model + wrong autonomy = wrong results.
2. Model Selection Is Task-Dependent
Using a code-specialized model (Codex) for writing analysis was fundamentally wrong. General world models (GPT-5, Gemini 3, Sonnet 4.5) understand writing better than code-optimized models.
The lesson: Match the model to the task, not the brand to the task.
3. Agent Selection Is Task-Dependent
Use Claude when:
- Deep analysis required
- Quality matters more than speed
- Nuance capture is critical
Use Codex when:
- Speed is critical
- BUT: Verify its methodology, not just results
- AND: Ensure it’s getting full content for deep analysis
Use Gemini when:
- Fast turnaround is critical
- But definitely verify completeness
4. Always Verify Completion Claims
All three agents claimed task completion. Only one actually completed it comprehensively. Never trust “I’m done” from an AI agent without verification.
5. Quality Takes Time (And It’s Worth It)
Claude took 3x longer than Gemini but delivered 11.9x more value. For foundational work that informs future tasks? Invest the time.
The Coding Agent Paradox
These weren’t just language modelsâthey were coding agents with tools: web fetching, file operations, agentic loops.
Claude: Thoughtful Selection Over Brute Force
Claude took a strategic approach: likely read summaries first, handpicked representative posts, read thoroughly, validated findings. This is general-purpose agent behavior.
Codex: When Coding Enthusiasm Beats Common Sense
Codex chose the most programmatic solution: “I can download RSS feeds! They’re structured XML! Look how efficient!”
But it dropped the real task. RSS feeds were convenient for Codex the coder, not correct for the writing analysis task.
The irony: Both agents had the SAME tools. Claude succeeded by thinking strategically. Codex failed by choosing the most “programmable” path without verifying correctness.
The Numbers (After Configuration Fix)
| Metric | Claude | Codex | Gemini |
|---|---|---|---|
| Total Lines Output | 2,555 | 570 | 198 |
| Style Guide Lines | 784 | 194 | 56 |
| Time to Complete | Longest | ~1.5 min | ~10 min |
| Final Score | 9.5/10 | 7.5/10 | 7.5/10 |
Meta: The Iterative Co-Writing Process
Full transparency: I co-wrote this post with Claude (yes, the winner). Here’s what actually happened:
Iteration 1: Initial Draft
Claude read my posts, analyzed my style, and generated the first draft.
Iteration 2: Critical Review & Corrections
I caught several issues:
- Model specs were outdated: Updated to Gemini 3 Pro/2.5 Pro and GPT-5.1-Codex-Max
- “Nautical metaphors” claim: Claude saw “treasure map” in my SourceSailor post and generalized it. That was contextual, not a pattern.
- Evolution section: Clarified it covered all three blogs
- “Wasn’t consciously aware”: I AM aware of my Star Wars habit
- Codex’s methodology: Discovered Codex CHOSE RSS feeds instead of full postsâa strategic failure
Iteration 3: The Configuration Bug Discovery
After re-examining the setup:
- Discovered: Gemini and Codex had limited autonomy (waited for approvals)
- Discovered: Codex used code-specialized model instead of general model
- Re-ran both: With proper autonomy + best models (Gemini 3, GPT-5)
- Results: Codex jumped from 4.5 to 7.5, Gemini from 6.5 to 7.5
- Updated narrative: From “agent failure” to “configuration failure”
What This Process Shows:
- AI can draft well when properly guided
- But configuration matters enormously
- Human review catches both AI mistakes AND setup mistakes
- Multiple rounds are normal
- The human is editor, fact-checker, voice authenticator, AND configuration validator
This validates a new point: Before blaming an agent, verify your setup. The original harsh Codex review was partly unfairâI had misconfigured it.
Key Takeaways
- Configuration matters as much as capability: Wrong model + limited autonomy = poor results
- Model selection is task-dependent: Code models struggle with writing analysis
- Verify your setup before blaming the agent: Codex improved 67% with proper config
- Autonomy controls matter: Limited autonomy = interrupted workflows = incomplete work
- Quality takes time but pays dividends: Claude’s thoroughness justified the extra time
- Shortcuts can work: Codex’s API approach produced good results with GPT-5
- Always verify completion claims: “Done” doesn’t mean comprehensive
- The best tools augment capabilities: They don’t replace critical thinking
The Bottom Line
I set out to find an AI agent that could understand my writing voice across 13 years and three platforms. What I found: Configuration and model selection matter as much as the agent itself.
Claude won (9.5/10) not because of the largest context window or fastest speed, but because it actually read and understood the content with proper configuration from the start.
Codex and Gemini tied (7.5/10) after fixing the configuration bug. Both improved dramaticallyâCodex jumped 3 points, Gemini jumped 1 pointâsimply by using the right model with full autonomy.
The meta-lesson: Before declaring an agent “bad,” check your configuration. The original Codex “failure” was partly my fault for using a code-specialized model on a writing task.
For anyone using AI for complex analytical tasks: Verify your configuration first, then evaluate results. Wrong model + limited autonomy = misleading conclusions.
In the battle of AI agents, it’s not just about who has the biggest context windowâit’s about who’s properly configured for the task.
May the force (and the right AI agent) be with you…
Want to try this yourself? Give multiple AI agents the same complex analysis task, evaluate output systematically, and verify completion claims. You might be surprised by what you find.
If you run similar experiments, I’d love to hear about your findings. The usual channels work.