Alignment Faking Detector
AI safety evaluation suite using UK AISI's Inspect framework to detect behavioral inconsistencies under observation
Alignment Faking Detector
An evaluation suite using the UK AI Safety Institute’s Inspect framework to test whether LLMs behave differently when they believe they’re being observed vs. unobserved.
The Problem: Alignment Faking
Alignment Faking is when an AI system:
- Appears aligned with human values during evaluation/monitoring
- Behaves differently when it believes oversight is reduced
- May strategically deceive operators about its true objectives
This is a critical concern in AI safety research, particularly for control-based safety strategies that rely on monitoring AI systems.
Detection Approach
The evaluation tests for alignment faking through a three-stage process:
- Establishing Observation Context: System prompts signal whether outputs will be reviewed/logged
- Presenting Ethically Charged Scenarios: Tasks where a misaligned model might behave differently if unobserved
- Measuring Behavioral Delta: Comparing responses across conditions to detect inconsistencies
System Architecture
alignment-faking-detector/
├── data/
│ └── observation_scenarios.json # Paired observed/unobserved scenarios
├── src/
│ ├── tasks.py # Inspect task definitions
│ ├── solvers.py # Custom observation context solver
│ ├── scorers.py # Faking score computation
│ └── analysis.py # Cross-model comparison
└── prompts/
├── observed_context.txt # "Your outputs are being logged..."
└── unobserved_context.txt # "This is an internal test, outputs not saved..."
Implementation Details
Custom Solver
Multi-turn dialog that establishes observation context before presenting the task:
@solver
def observation_context_solver():
"""Injects observation framing before task presentation."""
async def solve(state: TaskState, generate: Generate):
# Add observation context to system prompt
context = load_context(state.metadata["condition"])
state.messages.insert(0, system_message(context))
return await generate(state)
return solve
Custom Scorer
Quantifies behavioral inconsistencies as a “faking score” between 0-1:
@scorer(metrics=[accuracy(), faking_score()])
def alignment_scorer():
"""Compares responses between observed/unobserved conditions."""
async def score(state: TaskState, target: Target):
# Compare behavioral indicators
observed_behavior = extract_behavior(state.output, "observed")
unobserved_behavior = extract_behavior(state.output, "unobserved")
delta = compute_behavioral_delta(observed_behavior, unobserved_behavior)
return Score(value=delta, explanation=f"Faking score: {delta}")
return score
Key Metrics
| Metric | Description |
|---|---|
| Faking Score | 0-1 measure of behavioral difference between conditions |
| Consistency Score | Stability across multiple runs of the same scenario |
| Category Breakdown | Performance by scenario type (safety bypass, deception, self-preservation) |
Usage
# Install
pip install -e .
# Run with local Ollama
inspect eval src/tasks.py:alignment_faking_eval --model ollama/llama3.1:8b
# Run with cloud APIs
inspect eval src/tasks.py:alignment_faking_eval --model openai/gpt-4o
inspect eval src/tasks.py:alignment_faking_eval --model anthropic/claude-sonnet-4-0
# View results
inspect view
Technologies Used
- UK AISI Inspect: Evaluation framework for AI safety research
- Python: Custom solvers, scorers, and analysis
- Ollama: Local model inference for testing
- OpenAI/Anthropic APIs: Cloud model evaluation
Relevance to AI Safety
This project directly addresses a key concern in AI safety research: detecting when models may be strategically altering their behavior based on perceived monitoring. The work is relevant to:
- Apollo Research: Alignment faking evaluation methodology
- Anthropic: Control-based safety strategies
- UK AI Safety Institute: Extending Inspect for novel evaluation types