Your AI feature shipped green. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. You didn’t change anything. Time did.
LLM drift (also known as model drift or behavioral drift) is the gradual degradation or shift in a Large Language Model’s performance, accuracy, and behavior over time — even when not a single line of code has changed. [CONFIRMED] In production, this is often called the “six-month cliff.” A customer support chatbot that initially handled 70% of inquiries without escalation can silently drop to under 50% by month three — while infrastructure dashboards stay green the entire time. [SOURCE: Nebula]
The Three Forces of Silent Degradation
1. Silent Base-Model Updates
Model providers continuously update their models behind the scenes. [CONFIRMED] While these updates aim to improve the model, they can silently alter how it reasons, formats structured data, or handles refusals. One commonly reported incident: a silent model update doubled CI failure rates over three days as agent behavior shifted without a single line of code changing. [SOURCE: Nebula]
2. Eval Distribution Drift
When you launch an AI feature, you build an evaluation set reflecting the queries you can imagine. [CONFIRMED] But production traffic evolves. Users adopt new terminology, ask about features you’ve added since launch, and frame questions in ways your eval set never anticipated. Your eval suite stays green while real-world accuracy quietly erodes. [SOURCE: Nebula]
3. Knowledge Base Rot
In RAG systems, the model relies on external documents. If your documentation, pricing pages, or compliance rules change but aren’t actively updated in the vector database, the LLM will confidently generate answers based on obsolete facts. [CONFIRMED] RAG systems can lose roughly a third of their effective accuracy within 90 days purely due to knowledge staleness. [SOURCE: Nebula]
How Behavioral Change Manifests
Drift doesn’t look like traditional software bugs. It manifests as subtle semantic movements:
| Manifestation | What You See | Example |
|---|---|---|
| Verbosity shift | Responses become longer or shorter | GPT-4 exhibited 23% variance in response length over time |
| Instruction adherence | Model loses ability to follow complex instructions | Mixtral showed 31% inconsistency in instruction adherence |
| Refusal style | Safety guardrails drift — overly permissive or restrictive | Sudden spike in refusals for previously allowed queries |
| Hallucination increase | Model fills gaps with plausible but wrong information | Factuality scores drop without obvious cause |
[SOURCE: AIBIM]
Why Traditional Monitoring Fails
Traditional observability tracks CPU spikes, network latency, and 500 errors. [CONFIRMED] But LLMs fail silently. A model suffering from drift will still return a perfectly formatted “200 OK” response that’s factually wrong, semantically altered, or dangerously biased. [SOURCE: Nebula]
Only 62% of organizations running AI agents in production can inspect what their agents actually do at each step, despite 89% claiming to have observability in place. [SOURCE: Nebula]
The Monitoring Playbook
| Strategy | What It Does | How Often |
|---|---|---|
| Golden set testing | Run 50-500 representative prompts with expected outcomes | Continuously; every deployment; after provider updates |
| Behavioral fingerprinting | Capture baseline “fingerprint” of response length, tone, refusal style | Weekly |
| LLM-as-judge | Secondary model scores 1% sample of production traffic for correctness | Continuously |
| Statistical signal tracking | KL divergence on response length distributions; embedding drift | Daily |
| Knowledge base audit | Document freshness as a first-class metric | Every 60-90 days |
[SOURCE: Nebula, AIBIM]
The Refresh Cadence
| Task | Cadence | Trigger |
|---|---|---|
| Prompt re-evaluation | Every 30 days, or 48 hours after provider updates | Model provider announcements |
| Knowledge base audit | Every 60-90 days | Document expiration dates |
| Golden set expansion | Every quarter | Production failures and edge cases |
| Model version audit | Continuous | Provider deprecation timelines |
[SOURCE: Nebula]
The Solo Implementer Angle
If you’re one person managing AI, drift is invisible until a user complains. [OBSERVED] The law firm that killed a $150K project? The model had drifted in its legal reasoning over 8 weeks. No one noticed because no one was measuring. [SOURCE: Boundev]
Related
- RAG — Where knowledge base rot lives
- Operations & Maintenance — Where drift is monitored
- Silent Agent Failure — When drift produces wrong answers silently
- Knowledge Base Decay — When data staleness causes drift