AI Coding8 min read · 2026

12 things AI coding agents reliably get wrong (and how to spot them)

A field guide to the failure modes that show up across Claude, GPT, Gemini, and the rest in 2026.

AI coding agents are extraordinary at certain things and reliably bad at others. The bad list is shorter than you might think, and once you can spot the failure modes, the agent gets much more useful. Here are the twelve that show up most across Claude, GPT, Gemini, and the rest as of 2026.

1. Inventing functions that look real but do not exist

Especially in less-popular libraries. The agent writes client.fetch_all_pages() because every other pagination library has something like it, and yours does not. The signature is plausible. The function is not real. Spot it by running the code; any "undefined function" error in agent-written code is almost always this.

2. Confidently wrong about version-specific behavior

APIs change. Next.js 14 did not have stable Server Actions. Next.js 15 changed how cookies work. Tailwind v4 removed several v3 features. An agent trained on a corpus that mixes versions writes code that compiles but does the wrong thing. Always check the agent's assumptions about your version.

3. Refactoring more than you asked for

Ask for a one-line fix, get a 200-line refactor. The agent sees adjacent code that could be "cleaner" and rewrites it. Even when the new code is technically fine, it is not what you asked for, and it makes diffs unreviewable. Scope the request explicitly: "change only line 42."

4. Hallucinating error handling for cases that cannot happen

Try/except wrapping a synchronous local function call. Null checks on inputs that come from a typed schema. Retries on operations that are not network calls. The agent over-defends because it is optimizing for "robust-looking code," which is not the same as robust code. Watch for defensive code on functions that operate in trusted territory.

5. Using outdated security practices

Bcrypt with cost factor 10 (should be 12+ in 2026). MD5 for non-crypto purposes that should be blake2b for speed. JWT with HS256 when RS256 would let you rotate without invalidating tokens. Sanitizing HTML with regex. These patterns ship in agent-written code because they were the dominant patterns five years ago. Specifically prompt for "current best practices" when security is involved.

6. Reinventing what the framework already does

Writing custom auth middleware when the framework has it. Building a state manager when React's context plus useReducer would cover it. Custom date parsing when date-fns or Temporal would do. The agent will happily build the thing because it does not know what is already in your project. Show it your imports first.

7. Wrong about file paths and project structure

Importing from ../utils/helpers when your project uses absolute imports from @/lib/helpers. Putting components in components/ when your project uses app/components/. Naming files helpers.tsx when your project uses helpers.ts with named exports. Agents default to common conventions, not your conventions. Show them a similar file before asking them to create a new one.

8. Inventing the shape of payloads and schemas

"What does the Stripe webhook payload look like?" The agent will gladly tell you, with plausible field names that are mostly right and one or two that are wrong. The data you are processing is real and structured; ask the agent to fetch a single example first, or paste the real payload structure into context.

9. Bad at really long files

Past around 2000 lines, attention degrades. The agent forgets that you defined a helper at the top, redefines it at the bottom. Imports things that are already imported. Variable names start drifting. The fix: ask for surgical edits with line numbers, or extract the relevant section to a smaller file before working on it.

10. Bad at really long conversations

After 40-plus turns, especially after a compaction, the agent loses track of decisions made earlier. "Do not add comments" gets forgotten. "We use Tailwind v4 syntax" gets forgotten. The fix: maintain a project memory file (CLAUDE.md, .cursorrules, AGENTS.md depending on the tool) that the agent reads every turn. Memory in the prompt beats memory in conversation.

11. Mocking the database in tests by default

Default behavior is to mock the database so tests "pass cleanly." This is exactly the worst pattern for catching real bugs. Mocked tests pass while the production migration fails. Tell the agent explicitly to use integration tests against a real (or clean test) database for anything involving SQL.

12. Believing the README

If a project's README says "this uses Postgres" but the actual code talks to MongoDB (because someone migrated and did not update the README), the agent will believe the README. Read the actual code path before trusting the agent's understanding of an unfamiliar project.

The meta-mistake

The single most expensive pattern is trusting that the agent's output is right because it looks right. Modern agents are extraordinary writers: confident, well-structured, fluent. That fluency is uncorrelated with correctness. Run the code, check the output, verify the claim. Every time, even when you are tired.

Common questions

Does Claude make different mistakes than GPT or Gemini?

The rankings shift quarter to quarter, but the categories are the same. Claude tends to be slightly better at not over-refactoring and at avoiding hallucinated functions. GPT tends to be slightly better at recent web framework specifics. Gemini tends to be slightly better at long context. None of them is meaningfully immune to any of the twelve.

Does giving the agent more context fix this?

It helps with several: file paths, schemas, version assumptions. It does not help with the structural ones: over-refactoring, defensive code, mocked tests. Those need explicit instruction.

Are the new agentic coding tools (Cursor agent mode, Claude Code, etc.) better at this?

They are better at noticing the mistakes by running the code and reading the errors. That does not mean they make fewer mistakes; it means they correct themselves faster. The net result: noticeably more productive, same underlying failure modes.

Researched with Jester →