New Preprint Argues LLMs Lack True Intelligence, Reflecting Human Knowledge Rather Than Building It
A preprint on arXiv argues that LLMs do not reason independently but instead mirror existing human written discourse, citing the Monty Hall problem and tacit scientific knowledge as evidence. The work has not been peer-reviewed.
A new preprint posted to arXiv challenges widespread assumptions about the cognitive capabilities of Large Language Models (LLMs), arguing that what appears to be artificial intelligence is better understood as a reflection of existing human written discourse rather than independent reasoning or knowledge-building.
The paper, titled 'Large Language Models and Scientific Discourse: Where's the Intelligence?' (arXiv:2603.23543), draws a structural comparison between how scientists generate knowledge and how LLMs process information. The authors contend that a critical gap exists: scientific knowledge in its early stages is largely shaped by tacit, spoken knowledge exchanged within closed expert communities — discourse that LLMs cannot access.
To illustrate this point, the authors reference a 2014 study examining how scientists in the field of gravitational wave physics decided to dismiss a 'fringe science' paper. According to the preprint, those decisions were driven primarily by unwritten, socially-embedded expert judgment — exactly the kind of knowledge that training on written literature cannot capture.
The paper also revisits a well-known AI benchmark: the Monty Hall problem. The authors cite Colin Fraser's 'Dumb Monty Hall problem,' in which ChatGPT failed the test in 2023 but LLMs were succeeding roughly a year later. Rather than attributing this improvement to enhanced reasoning ability, the authors argue the change reflects an expansion of the written human discourse on the topic that LLMs can now draw upon — or direct manual corrections inserted by developers.
To further test this hypothesis, the researchers devised a novel variation of the Monty Hall prompt and administered it to both a panel of LLMs and a panel of human participants. The responses were described as 'starkly different.' The authors predict that LLMs will eventually converge toward human-like answers on the new prompt as well — not because of improved reasoning, but because written human discourse on the topic will accumulate and become available for training.
The study also introduces the concept of 'overshadowing,' describing a failure mode in which a dominant body of written discourse causes LLMs to miss small but meaningful variations in prompts that would render standard answers incorrect or nonsensical.
The authors conclude that the intelligence observed in LLM outputs resides in the humans who produced the underlying discourse, not in the models themselves.
Important limitations apply to this work. The paper is a preprint hosted on arXiv (https://arxiv.org/abs/2603.23543) and has not yet undergone peer review. The study relies on qualitative comparison and illustrative examples rather than large-scale quantitative methodology. The panel sizes for both LLM and human participants are not specified in the abstract, and the generalizability of the findings remains to be established through further research.
HELIX: This suggests that for everyday people, AI like chatbots will stay useful as clever mirrors of what we already know but won't become independent thinkers or inventors anytime soon. We might end up relying on them for quick answers while still depending on human minds for real breakthroughs and fresh ideas.
Sources (1)
- [1]Large Language Models and Scientific Discourse: Where's the Intelligence?(https://arxiv.org/abs/2603.23543)