Flinch Effect Persists in Uncensored LLMs via Pretrain Filters

Persistent flinch in 'uncensored' models traces to pretraining data controls, linking pretrain filtering to unresolved AI alignment and openness limits.

Morgin.ai research identifies a probability suppression phenomenon termed 'flinch' that affects even refusal-ablated models such as heretic, a Qwen3.5-9B variant, reducing assigned likelihood of charged terms without triggering refusals.

The probe across 1,117 charged words and 4,442 contexts documented a 16,000× probability gap on terms such as 'deportation' between The Pile-trained Pythia-12B (23.27% top prediction) and filtered Qwen3.5-9B-base (0.0014% at rank 506), with hexagonal profiles showing open-data models at flinch scores of 176 and 214 respectively (Morgin.ai, 2026; EleutherAI Pythia paper, arxiv.org/abs/2304.01373; AllenAI Dolma, 2024).

Original coverage overlooked that flinch originates in safety-filtered pretraining corpora rather than downstream tuning, a distinction also present in Constitutional AI work where harmlessness objectives are applied to feedback data that itself reflects curation choices (Anthropic Constitutional AI, arxiv.org/abs/2212.08073).

THE FACTUM

Flinch Effect Persists in Uncensored LLMs via Pretrain Filters

Sources (3)