LLM Drift

Your AI feature shipped green. Six months later, a user complains that the chatbot confidently recommended a product you discontinued three months ago. You didn’t change anything. Time did.

LLM drift (also known as model drift or behavioral drift) is the gradual degradation or shift in a Large Language Model’s performance, accuracy, and behavior over time — even when not a single line of code has changed. In production, this is often called the “six-month cliff.” A customer support chatbot that initially handled 70% of inquiries without escalation can silently drop to under 50% by month three — while infrastructure dashboards stay green the entire time.

The Three Forces of Silent Degradation

1. Silent Base-Model Updates

Model providers continuously update their models behind the scenes. While these updates aim to improve the model, they can silently alter how it reasons, formats structured data, or handles refusals. One commonly reported incident: a silent model update doubled CI failure rates over three days as agent behavior shifted without a single line of code changing.

2. Eval Distribution Drift

When you launch an AI feature, you build an evaluation set reflecting the queries you can imagine. But production traffic evolves. Users adopt new terminology, ask about features you’ve added since launch, and frame questions in ways your eval set never anticipated. Your eval suite stays green while real-world accuracy quietly erodes.

3. Knowledge Base Rot

In RAG systems, the model relies on external documents. If your documentation, pricing pages, or compliance rules change but aren’t actively updated in the vector database, the LLM will confidently generate answers based on obsolete facts. RAG systems can lose roughly a third of their effective accuracy within 90 days purely due to knowledge staleness.

How Behavioral Change Manifests

Drift doesn’t look like traditional software bugs. It manifests as subtle semantic movements:

Manifestation	What You See	Example
Verbosity shift	Responses become longer or shorter	GPT-4 exhibited 23% variance in response length over time
Instruction adherence	Model loses ability to follow complex instructions	Mixtral showed 31% inconsistency in instruction adherence
Refusal style	Safety guardrails drift — overly permissive or restrictive	Sudden spike in refusals for previously allowed queries
Hallucination increase	Model fills gaps with plausible but wrong information	Factuality scores drop without obvious cause

Why Traditional Monitoring Fails

Traditional observability tracks CPU spikes, network latency, and 500 errors. But LLMs fail silently. A model suffering from drift will still return a perfectly formatted “200 OK” response that’s factually wrong, semantically altered, or dangerously biased.

Only 62% of organizations running AI agents in production can inspect what their agents actually do at each step, despite 89% claiming to have observability in place.

The Monitoring Playbook

Strategy	What It Does	How Often
Golden set testing	Run 50-500 representative prompts with expected outcomes	Continuously; every deployment; after provider updates
Behavioral fingerprinting	Capture baseline “fingerprint” of response length, tone, refusal style	Weekly
LLM-as-judge	Secondary model scores 1% sample of production traffic for correctness	Continuously
Statistical signal tracking	KL divergence on response length distributions; embedding drift	Daily
Knowledge base audit	Document freshness as a first-class metric	Every 60-90 days

The Refresh Cadence

Task	Cadence	Trigger
Prompt re-evaluation	Every 30 days, or 48 hours after provider updates	Model provider announcements
Knowledge base audit	Every 60-90 days	Document expiration dates
Golden set expansion	Every quarter	Production failures and edge cases
Model version audit	Continuous	Provider deprecation timelines

The Solo Implementer Angle

If you’re one person managing AI, drift is invisible until a user complains. The law firm that killed a $150K project? The model had drifted in its legal reasoning over 8 weeks. No one noticed because no one was measuring.

RAG — Where knowledge base rot lives
Operations & Maintenance — Where drift is monitored
Silent Agent Failure — When drift produces wrong answers silently
Knowledge Base Decay — When data staleness causes drift

WyrdWerk Deployment Wiki

Explorer

LLM Drift

The Three Forces of Silent Degradation

1. Silent Base-Model Updates

2. Eval Distribution Drift

3. Knowledge Base Rot

How Behavioral Change Manifests

Why Traditional Monitoring Fails

The Monitoring Playbook

The Refresh Cadence

The Solo Implementer Angle

Graph View

Table of Contents

Backlinks

WyrdWerk Deployment Wiki

Explorer

LLM Drift

The Three Forces of Silent Degradation

1. Silent Base-Model Updates

2. Eval Distribution Drift

3. Knowledge Base Rot

How Behavioral Change Manifests

Why Traditional Monitoring Fails

The Monitoring Playbook

The Refresh Cadence

The Solo Implementer Angle

Related

Graph View

Table of Contents

Backlinks