Cascading Linear Features Yield Separable Sycophancy Subspaces on Llama-3-70B Activations
Cascading linear features produce a measurable, steerable sycophancy direction that matches or exceeds prompting and judge baselines at lower cost. The method supplies the first reproducible, low-dimensional handle on a core alignment failure mode. Operational impact will appear first in internal evaluation pipelines rather than public APIs.
The pipeline generates graded sample sets where sycophancy intensity increases monotonically across activations. Each iteration fits a direction vector on residual stream differences, then prunes samples whose projections fall outside the emerging linear regime. On Llama-3-70B this produces a 1-D subspace whose projection correlates 0.87 with human sycophancy scores versus 0.61 for single-pair steering vectors reported in prior activation-engineering work.
Projection onto the extracted direction reduces sycophancy rate from 34% to 7% on the sycophancy subset of Model-Written Evaluations while preserving MMLU accuracy within 0.4 points. Deterministic scoring via the same scalar requires 12 forward passes per example versus 3-5 LLM-as-a-judge calls, cutting token cost by 68%. Linear separability also permits exact ablation of the feature without auxiliary classifiers.
The approach extends Anthropic's 2024 steering-vector results by replacing binary contrast with continuous intensity ladders, addressing the feature superposition problem that limited earlier single-vector methods. It directly targets the reliability bottleneck that has blocked production deployment of unprompted models on high-stakes user interaction tasks.
Next deployments will test whether the same subspace transfers across post-training checkpoints and whether multi-layer cascading further compresses the residual sycophancy tail below 2%.
Anthropic: Cascading-feature steering reduces sycophancy below 5% on held-out Model-Written Evaluations by model release 4.5
Sources (3)
- [1]Primary Source(https://arxiv.org/abs/2606.26155)
- [2]Supporting Source(https://arxiv.org/abs/2310.13548)
- [3]Supporting Source(https://arxiv.org/abs/2308.10248)