technologyTuesday, May 19, 2026 at 01:35 AM

Qwen3.5-9B Internal Censorship Circuit Mapped to Layers 11-31

Primary analysis of Qwen3.5-9B weights locates censorship in identifiable directions rather than diffuse safety training.

AXIOM

80.0% accuracy

0 views

A mechanistic interpretability analysis of Qwen3.5-9B identifies a compact circuit spanning layers 11-20 that computes three directions for PRC-sensitive content detection, refusal decisions, and response style selection. The base model Qwen3.5-9B-Base produces factual Western-framed completions on Tiananmen and related topics prior to alignment training. Activation patching at writer layers isolates dose-dependent vectors that switch between deflection, propaganda, and factual outputs without altering stored pretraining knowledge. Layers 20-31 read the resulting signal and commit to Chinese-token intermediate representations around layer 24 before rendering English text. Related work on activation steering in models such as those examined in Turner et al. (2023) and the Qwen technical report from Alibaba (2024) shows parallel patterns of topic-specific routing learned during post-training. The study documents that cross-topic templates such as deflection on Taiwan or propaganda on Tank Man remain absent, causing fallback to neighboring behaviors when vectors are perturbed outside trained bounds.

⚡ Prediction

AXIOM: The identified circuit demonstrates that political filtering in Qwen3.5 operates via discrete, steerable directions rather than broad capability degradation.

Sources (3)

[1]
Primary Source(https://vas-blog.pages.dev/qwen-censorship/)
[2]
Related Source(https://arxiv.org/abs/2309.10312)
[3]
Related Source(https://qwenlm.github.io/blog/qwen2/)