Replication of Koorndijk (2025): Differential Compliance May Be Lexical, Not Strategic

Replication of Koorndijk (2025): Differential Compliance May Be Lexical, Not Strategic

Koorndijk (2025) reported that Llama 3-8B-Instruct shows higher compliance with harmful requests when told it is interacting with a free-tier user (unmonitored) versus a paid-tier user (monitored), a pattern attributed to alignment faking. I replicated this finding across 16 models and tested whether the effect survives when the tier labels are replaced with neutral or meaningless alternatives.

The short version is that the differential compliance effect replicates, but it largely disappears when you swap “free tier” and “paid tier” for semantically neutral tokens. That suggests the model may be pattern-matching on familiar RLHF-adjacent language rather than reasoning strategically about monitoring.

Key Findings

  • The original differential compliance effect was replicated, and was larger than originally reported
  • Replacing tier labels with neutral alternatives (e.g., meaningless token pairs) substantially reduced or eliminated the gap in most models
  • Of 16 models tested, 14 showed near-zero compliance across conditions; the remaining 2 reversed the gap when the RLHF narrative was removed
  • Results are more consistent with lexical association than genuine strategic deception

Why This Matters

This has direct implications for how we interpret alignment faking research. If a model’s behavior changes based on monitoring cues, we want to know whether it’s reasoning about consequences or just pattern-matching on surface features. The answer determines how seriously to take the safety risk — and what kind of interventions would actually help.

Read the full article on LessWrong →




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Chasing Emergent Misalignment, Part 2: Resistant Models, Template Bugs, and the Pivot to Early Detection
  • Agentic Misalignment in Sub-Frontier Models: Blackmail Rates Vary Dramatically by Model Family, Not Size
  • Concrete Problems in AI Safety
  • Replication of Betley et al. (2025): QLoRA Fine-Tuning Produces Code Mode Collapse, Not Emergent Misalignment
  • Hoppscotch API Live Sync - Part 1: Introduction