Your AI feature shipped green. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. You didn’t change anything. Time did.

LLM drift (also known as model drift or behavioral drift) is the gradual degradation or shift in a Large Language Model’s performance, accuracy, and behavior over time — even when not a single line of code has changed. [CONFIRMED] In production, this is often called the “six-month cliff.” A customer support chatbot that initially handled 70% of inquiries without escalation can silently drop to under 50% by month three — while infrastructure dashboards stay green the entire time. [SOURCE: Nebula]

The Three Forces of Silent Degradation

1. Silent Base-Model Updates

Model providers continuously update their models behind the scenes. [CONFIRMED] While these updates aim to improve the model, they can silently alter how it reasons, formats structured data, or handles refusals. One commonly reported incident: a silent model update doubled CI failure rates over three days as agent behavior shifted without a single line of code changing. [SOURCE: Nebula]

2. Eval Distribution Drift

When you launch an AI feature, you build an evaluation set reflecting the queries you can imagine. [CONFIRMED] But production traffic evolves. Users adopt new terminology, ask about features you’ve added since launch, and frame questions in ways your eval set never anticipated. Your eval suite stays green while real-world accuracy quietly erodes. [SOURCE: Nebula]

3. Knowledge Base Rot

In RAG systems, the model relies on external documents. If your documentation, pricing pages, or compliance rules change but aren’t actively updated in the vector database, the LLM will confidently generate answers based on obsolete facts. [CONFIRMED] RAG systems can lose roughly a third of their effective accuracy within 90 days purely due to knowledge staleness. [SOURCE: Nebula]

How Behavioral Change Manifests

Drift doesn’t look like traditional software bugs. It manifests as subtle semantic movements:

ManifestationWhat You SeeExample
Verbosity shiftResponses become longer or shorterGPT-4 exhibited 23% variance in response length over time
Instruction adherenceModel loses ability to follow complex instructionsMixtral showed 31% inconsistency in instruction adherence
Refusal styleSafety guardrails drift — overly permissive or restrictiveSudden spike in refusals for previously allowed queries
Hallucination increaseModel fills gaps with plausible but wrong informationFactuality scores drop without obvious cause

[SOURCE: AIBIM]

Why Traditional Monitoring Fails

Traditional observability tracks CPU spikes, network latency, and 500 errors. [CONFIRMED] But LLMs fail silently. A model suffering from drift will still return a perfectly formatted “200 OK” response that’s factually wrong, semantically altered, or dangerously biased. [SOURCE: Nebula]

Only 62% of organizations running AI agents in production can inspect what their agents actually do at each step, despite 89% claiming to have observability in place. [SOURCE: Nebula]

The Monitoring Playbook

StrategyWhat It DoesHow Often
Golden set testingRun 50-500 representative prompts with expected outcomesContinuously; every deployment; after provider updates
Behavioral fingerprintingCapture baseline “fingerprint” of response length, tone, refusal styleWeekly
LLM-as-judgeSecondary model scores 1% sample of production traffic for correctnessContinuously
Statistical signal trackingKL divergence on response length distributions; embedding driftDaily
Knowledge base auditDocument freshness as a first-class metricEvery 60-90 days

[SOURCE: Nebula, AIBIM]

The Refresh Cadence

TaskCadenceTrigger
Prompt re-evaluationEvery 30 days, or 48 hours after provider updatesModel provider announcements
Knowledge base auditEvery 60-90 daysDocument expiration dates
Golden set expansionEvery quarterProduction failures and edge cases
Model version auditContinuousProvider deprecation timelines

[SOURCE: Nebula]

The Solo Implementer Angle

If you’re one person managing AI, drift is invisible until a user complains. [OBSERVED] The law firm that killed a $150K project? The model had drifted in its legal reasoning over 8 weeks. No one noticed because no one was measuring. [SOURCE: Boundev]