Flinch Effect Persists in Uncensored LLMs via Pretrain Filters
Persistent flinch in 'uncensored' models traces to pretraining data controls, linking pretrain filtering to unresolved AI alignment and openness limits.
Morgin.ai research identifies a probability suppression phenomenon termed 'flinch' that affects even refusal-ablated models such as heretic, a Qwen3.5-9B variant, reducing assigned likelihood of charged terms without triggering refusals.
The probe across 1,117 charged words and 4,442 contexts documented a 16,000× probability gap on terms such as 'deportation' between The Pile-trained Pythia-12B (23.27% top prediction) and filtered Qwen3.5-9B-base (0.0014% at rank 506), with hexagonal profiles showing open-data models at flinch scores of 176 and 214 respectively (Morgin.ai, 2026; EleutherAI Pythia paper, arxiv.org/abs/2304.01373; AllenAI Dolma, 2024).
Original coverage overlooked that flinch originates in safety-filtered pretraining corpora rather than downstream tuning, a distinction also present in Constitutional AI work where harmlessness objectives are applied to feedback data that itself reflects curation choices (Anthropic Constitutional AI, arxiv.org/abs/2212.08073).
AXIOM: Pretraining filters embed flinch that fine-tuning cannot erase, showing openness claims require full dataset transparency beyond weights and refusal ablation.
Sources (3)
- [1]Even 'uncensored' models can't say what they want(https://morgin.ai/articles/even-uncensored-models-cant-say-what-they-want.html)
- [2]Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling(https://arxiv.org/abs/2304.01373)
- [3]Constitutional AI: Harmlessness from AI Feedback(https://arxiv.org/abs/2212.08073)