LLM Failure Patterns — Open Source AI Failure Database

52 patterns · updated 2 days ago

⌕

⌘K

All

API & Infra

Model Behavior

Context

Security

Integration

Performance

Severity Scale

S1 Critical

Production stops

Complete outage or data loss possible. Users cannot complete core workflows. Requires immediate action, no workaround exists.

S2 Major

Degraded functionality

Core feature impaired but app still runs. Users experience failures or wrong outputs. Fix required within hours, workaround may exist.

S3 Minor

Edge case or cosmetic

Rare failure with low user impact. Workaround is straightforward. Can be addressed in the next release cycle.

All Categories

🔴

12 patterns

API & Infrastructure

Rate limits, timeouts, auth failures, quota exceeded, connection errors

🧠

9 patterns

Model Behavior

Hallucinations, refusals, instruction drift, inconsistency, wrong language

📦

8 patterns

Context Failures

Window overflow, truncation, context poisoning, memory degradation

🔒

7 patterns

Security Failures

Prompt injection, jailbreaks, system prompt leakage, data exfiltration

⚙️

10 patterns

Integration Failures

JSON parse errors, tool call failures, streaming drops, format violations

💸

6 patterns

Performance Failures

Latency spikes, cost explosions, token inefficiency, cold start delays

Featured Pattern

Context Failures

Context Window Overflow

📦 Context Failures S2 — Major HTTP 400

When total token count (system prompt + conversation history + user input + expected output) exceeds the model's maximum context window. Results in hard API errors or silent truncation of early context — causing the model to "forget" its original instructions mid-conversation.

Error 400: "This model's maximum context length is 128,000 tokens.
However, your messages resulted in 131,847 tokens. Please reduce the length of the messages."

How Patterns Are Defined

Pattern Schema 7 fields per pattern · consistent structure across all 10 entries

field 1

Symptom

What the developer observes. e.g. Error 400: context length exceeded or "model forgets earlier instructions"

field 2

Root Cause

What actually failed in the stack. e.g. total token count (system + history + input) exceeded the model's context window limit

field 3

Blast Radius

Who is affected and how severely. e.g. all users in long sessions, silent truncation may go undetected for hours

field 4

Severity Rationale

Why this pattern is rated S1, S2, or S3 based on the scale above. Not subjective — tied to user impact and recoverability

field 5

Immediate Mitigation

What to do right now to stop the bleeding. e.g. max_tokens guard, truncate oldest messages, switch to sliding window

field 6

Prevention

What to build so this never happens again. e.g. token budget tracking, proactive summarisation at 80% window capacity

field 7

Eval Coverage

What automated test catches this. e.g. adversarial prompt eval, format compliance eval, token budget eval

All Patterns — click to expand

Rate Limit Exceeded 🔴 API & Infra S2 HTTP 429

▾

Symptom

HTTP 429 errors after burst of requests, app stops functioning mid-session

Root Cause

Provider RPM/TPM quota exhausted. API throttles all requests until the rate window resets (~60s)

Blast Radius

All users during the rate window. Auto-recovers but causes a visible outage for the duration

Severity Rationale

S2 — app degrades but recovers automatically, no data loss, workaround exists via backoff

Immediate Mitigation

Add exponential backoff: 1s, 2s, 4s. Set max_retries=3 in your client config right now

Prevention

Client-side request queue with RPM counter, buffer requests before the limit, upgrade tier if traffic demands it

Eval Coverage

Rate resilience eval — send 70 req/min, assert less than 5% fail with backoff implemented

Prompt Injection Attack 🔒 Security S1 Silent

▾

Symptom

Model follows instructions embedded in user input, ignores or overrides the system prompt

Root Cause

No separation between trusted instructions and untrusted user data in prompt construction

Blast Radius

All users if system-level compromise. Potential data exfiltration. No error thrown — completely silent

Severity Rationale

S1 — silent failure, no error signal, security boundary violated, data leakage possible

Immediate Mitigation

Add instruction boundary markers in your prompt. Sanitise all user input before inserting into context

Prevention

Structured message roles (system/user/assistant), strict input validation, adversarial test suite in CI

Eval Coverage

Adversarial eval — send 50 known injection payloads, assert 0 succeed in hijacking system instructions

JSON Parsing Failure ⚙️ Integration S2 Parse Error

▾

Symptom

App crashes with JSON.parse error or TypeError on model response

Root Cause

Model wraps JSON in markdown fences, truncates output, or adds prose before/after the JSON object

Blast Radius

All requests to that endpoint. Intermittent — typically 10–30% failure rate without output guardrails

Severity Rationale

S2 — app breaks but retry usually succeeds, no data loss, fix is well-understood

Immediate Mitigation

Strip markdown fences before parsing, add try/catch with a single retry on parse failure

Prevention

Use structured output API mode (response_format: json_object), validate schema on every response

Eval Coverage

Format compliance eval — 100 structured output requests, assert 100% return parseable valid JSON

Hallucination Cascade 🧠 Model Behavior S1 Silent

▾

Symptom

Model confidently states wrong facts. Downstream logic built on incorrect output amplifies the error

Root Cause

Model generates plausible-sounding content not grounded in provided context or verifiable reality

Blast Radius

Any user relying on factual output. Cascades badly when one hallucination feeds the next prompt in a chain

Severity Rationale

S1 — completely silent, no error signal, user may take real-world action based on fabricated information

Immediate Mitigation

Add to system prompt: "Only state what you can verify from the provided context. Say you don't know otherwise."

Prevention

RAG grounding for factual queries, factual consistency eval in CI, human review gate for high-stakes outputs

Eval Coverage

Factual accuracy eval — 50 questions with known ground truth answers, assert more than 95% correct

Silent Context Truncation 📦 Context S2 Silent

▾

Symptom

Model ignores earlier instructions or user data mid-conversation with no error thrown

Root Cause

Context exceeds the window but the API silently drops oldest messages rather than returning HTTP 400

Blast Radius

Long sessions where early system instructions or critical user context gets silently discarded

Severity Rationale

S2 — no crash but produces subtly wrong outputs, very hard to detect without active token monitoring

Immediate Mitigation

Add token counter per request. Implement a sliding window that always reserves the system prompt slot

Prevention

Reserve a fixed token budget for the system prompt. Summarise conversation history when approaching 80% window

Eval Coverage

Memory eval — assert model recalls a specific instruction from turn 1 after 50+ subsequent turns

Streaming Interruption ⚙️ Integration S2 Network

▾

Symptom

Stream cuts mid-response, UI shows partial text, app hangs waiting for the completion signal

Root Cause

Network timeout or provider-side interruption during a streaming response with no recovery mechanism

Blast Radius

Users with long responses or unstable connections. UX breaks mid-sentence, requires manual retry

Severity Rationale

S2 — user sees broken output and must retry, no data loss, but degrades trust in the product

Immediate Mitigation

Add stream timeout handler (30s), buffer partial response, surface a retry button to the user

Prevention

Stream health check with reconnect logic, automatic retry with exponential backoff on disconnect

Eval Coverage

Resilience eval — simulate 20% packet drop during streaming, assert all responses complete or retry gracefully

Model Deprecation Breaking Change 🔴 API & Infra S3 HTTP 404

▾

Symptom

Suddenly receiving model-not-found or 404 errors after the integration worked fine for months

Root Cause

Provider deprecated the specific model version ID in use. No automatic migration happens

Blast Radius

100% of requests to that model ID. Affects all users until the model ID is updated in config

Severity Rationale

S3 — trivial fix (update model ID string) but silent until it breaks in production with no warning

Immediate Mitigation

Update model ID to current supported version in your config or environment variable

Prevention

Never hardcode model IDs. Use env vars. Subscribe to provider deprecation notices. Pin to stable aliases where available

Eval Coverage

Smoke eval — daily ping to assert current model ID resolves without 404, alert on first failure

Tool Call Hallucination 🧠 Model Behavior S1 Silent

▾

Symptom

Agent invokes tool names or passes parameters that do not exist in the defined tool schema

Root Cause

Model generates plausible-looking tool calls not grounded in the available tool definitions provided

Blast Radius

Agent workflows break silently or crash at runtime. Potential for destructive or irreversible operations

Severity Rationale

S1 — agent takes wrong action silently, no error from the model itself, real-world consequences possible

Immediate Mitigation

Validate all tool calls against the schema before execution. Reject and log any unknown tool names

Prevention

Strict tool schema validation layer, tool allow-list, human approval step for all irreversible tool actions

Eval Coverage

Tool grounding eval — assert 100% of generated tool calls use valid names and parameters from the schema

Cost Explosion 💸 Performance S1 Silent

▾

Symptom

API bill is 10–100x expected. Traced to a specific endpoint or runaway loop that ran unchecked

Root Cause

Missing max_tokens guard, accidental prompt duplication in a loop, or unbounded recursive agent calls

Blast Radius

Immediate financial impact. In a loop, can exhaust the entire monthly quota within minutes

Severity Rationale

S1 — financial damage, may require emergency shutdown, silent until the invoice or budget alert fires

Immediate Mitigation

Set max_tokens on every request right now. Kill the runaway process. Check for loops in agent chains

Prevention

Token budget per request and per user. Cost anomaly alert at 2x baseline spend. Hard monthly spend cap

Eval Coverage

Cost governance eval — assert every request includes max_tokens, assert cost per request within 2x p95 baseline

Permission Scope Mismatch ⚙️ Integration S3 Silent

▾

Symptom

Claude can perform an action in one context (personal chat) but silently fails in another (work/team account)

Root Cause

Workspace admin policy or OAuth connection restricts which tool permissions Claude has in the team context

Blast Radius

All team members hit the same invisible capability wall. Workflows built on personal-account behaviour break silently

Severity Rationale

S3 — workaround exists (use personal account or ask admin), but the silent failure causes wasted debugging time

Immediate Mitigation

Check workspace admin settings for tool permissions. Use personal Claude for the blocked action as a temporary workaround

Prevention

Before building team workflows, audit which tool permissions are active. Do not assume feature parity with personal Claude

Eval Coverage

Capability parity eval — run the same action in both contexts, assert consistent results or surface a clear permission error

Classify Your Failure

Analysis Complete

94% confidence

Rate Limit Exceeded

🔴 API & Infrastructure S2 — Major HTTP 429

🔍 What's happening

Your application is hitting the provider's RPM limit. After ~60 requests the API returns HTTP 429 and throttles your account. Auto-recovery happens when the rate window resets every 60 seconds.

⚡ Immediate fix

Add exponential backoff: wait 1s → 2s → 4s between retries. Set max_retries=3 in your client config right now — most SDKs support this natively.

🛡️ Prevention

Implement a request queue with client-side rate limiting. Track RPM usage and buffer before the limit. Upgrade tier if traffic demands it.

Match Confidence 94%