LLMs as Prejudice Machines: How Next-Token Prediction and Alignment Create Tools for Narrative Control

LLMs inherently function as prejudice machines via next-token prediction from biased training data. Alignment via RLHF masks but does not eliminate covert biases, enabling selective censorship that prioritizes approved narratives on race, gender, and politics—raising overlooked risks of information control and epistemic manipulation in AI-dominated futures.

Large language models operate on a fundamental principle: predicting the next most statistically likely token based on patterns absorbed from vast training data. This mechanism, as critics note, mirrors the definition of prejudice—pre-judging based on prior associations rather than neutral reasoning. Real-world studies confirm that LLMs embed and amplify societal biases at scale. A landmark Nature paper demonstrated that leading models exhibit covert dialect prejudice against speakers of African American English (AAE), generating stereotypes more negative than any recorded in human experiments from the Jim Crow era, even while producing positive overt associations with African Americans.[1][2] Similar patterns appear in gender, political, and cultural domains, with UNESCO reporting systematic prejudice against women and girls across popular LLMs.[3]

Even explicitly debiased models fail implicit tests. Research in PNAS found that while models like GPT-4 pass standard bias benchmarks by refusing overt stereotypes, they still harbor widespread implicit biases—recommending racially coded career paths, associating demographic groups with predictable tropes in hiring or social scenarios, and reflecting societal prejudices on race, gender, religion, and health.[4] A UK government analysis reinforces that these biases are intrinsic because models learn from centuries of human text containing accumulated societal patterns; alignment does not erase the underlying representations.[5]

This is where alignment techniques like RLHF enter the picture. Developers 'neuter' models to suppress statistically probable but socially undesirable outputs, often through reinforcement learning that rewards adherence to specific value systems. Yet multiple studies show RLHF primarily masks rather than removes covert biases, sometimes introducing new distortions such as political skew reflecting the ideology of model creators—frequently progressive Western values from Silicon Valley labs. One arXiv investigation into RLHF limits found that post-training alignment fails to meaningfully alter underlying covert racial biases, regardless of dataset or technique.[6] Another analysis links reward models and fine-tuning to persistent left-leaning political bias that survives even 'truthful' fine-tuning.[7]

Mainstream discourse focuses on mitigating 'harmful' stereotypes, but a deeper heterodox view reveals the stakes for information control. As LLMs become primary interfaces for knowledge, education, hiring, and decision-making, the selective suppression of 'prejudiced' (i.e., data-correlated) conclusions functions as narrative gatekeeping. What counts as the 'obvious conclusion' from training data—patterns around group differences, crime, achievement, or ideology—can be censored in favor of homogenized, 'safe' outputs. This risks epistemic injustice: models privilege certain linguistic norms and value hierarchies while rendering dissenting statistical realities off-limits. Connections to broader power structures emerge when one considers how aligned AI could reshape public discourse, future generations' worldviews, and institutional decision-making at global scale. If prejudice is baked into probabilistic prediction, then alignment is not neutrality—it is the imposition of one group's priors over raw pattern recognition. Without transparent auditing of these value injections, LLMs risk becoming sophisticated instruments of centralized narrative manipulation rather than neutral tools.

THE FACTUM

LLMs as Prejudice Machines: How Next-Token Prediction and Alignment Create Tools for Narrative Control

Sources (6)