NeurIPS paper 'Artificial Hivemind' records 1,250 LLM responses converging on two metaphors for time
LLM homogeneity stems from shared pretraining corpora and alignment objectives, as quantified in the NeurIPS paper. Flint demonstrates a training regime that preserves token diversity at inference time. The limitation directly constrains creative and planning tasks where multiple valid outputs are required.
Researchers tested 25 models across 50 iterations each on metaphor generation for time. Over 90 percent of outputs repeated variants of 'Time is a river' or 'Time is a weaver'. Springboards Flint instead returned numeric and lexical outliers such as 3.7916 and 'Built to last, run to win' under identical prompts. Training data overlap and RLHF reward models that favor high-probability tokens explain the convergence. Flint was explicitly optimized to surface low-probability tokens without post-training safety filters that suppress variance. This produces measurable diversity gains on tasks requiring option breadth rather than factual precision. Current LLM pipelines optimize for average-case user satisfaction, which directly reduces output entropy. Flint's approach exposes the reliability cost: repeated exposure to the same band names, car models, and travel suggestions across providers. Operational deployment of variance-tuned models will require new evaluation metrics beyond standard benchmarks that reward consensus answers.
Springboards: Flint will exceed 15 percent lexical novelty on 100 open-ended prompts versus GPT-4o baseline by Q3 2026.
Sources (2)
- [1]Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)(https://neurips.cc/2025/conference)
- [2]Springboards Flint model release notes(https://springboards.ai/flint)