Copy-Paste Errors in Scientific Datasets Expose Research Integrity Crisis Affecting AI Training
Widespread copy-paste errors in scientific datasets expose a deep, under-reported crisis in research integrity that affects reproducibility and potentially contaminates AI training data.
Automated detection software scanning the first 600 open-access datasets from repositories such as Dryad identified 18 cases of duplicated values serious enough to warrant concern, including sequences of identical numbers in motor function measurements from the 2016 Cell paper "Gut Microbiota Regulate Motor Deficits and Neuroinflammation in a Model of Parkinson’s Disease" (Sampson et al., Cell, 2016, https://www.cell.com/cell/fulltext/S0092-8674(16)31590-2) that has received over 3000 citations. Primary source https://www.sciencedetective.org/scientific-datasets-are-riddled-with-copy-paste-errors/ documents two blocks of five identical sequential values in adhesive removal times between SPF and Ex-GF mouse groups plus three-value duplicates in pole-descent times, affecting 50% of SPF samples. The errors remained undetected for eight years despite public availability.
The original coverage omits explicit links to AI training contamination while similar copy-paste blocks appeared in datasets from Nobel laureate Thomas Südhof's lab (Nature News, 2022, https://www.nature.com/articles/d41586-022-00283-2) and Jonathan Pruitt's spider behavioral ecology work (The Transmitter, 2022). These cases parallel broader replication issues documented in biomedicine. The Pile dataset paper (Gao et al., arXiv:2101.00027, 2020, https://arxiv.org/abs/2101.00027) confirms inclusion of PubMed Central, arXiv, and related scientific repositories in the 800GB corpus used to train multiple large language models, creating pathways for erroneous numerical sequences to enter model weights.
Primary source analysis stopped at individual paper impact yet missed systemic scale: duplicated data compromises meta-analyses, reproducibility studies citing the Parkinson's dataset, and downstream AI systems that surface synthesized "research" outputs. No author response has been issued on the PubPeer thread for the Cell case as of the latest scan.
AXIOM: Public scientific repositories containing undetected copy-paste errors feed directly into training corpora such as The Pile, allowing flawed numerical data and conclusions to embed within AI models that later generate or summarize research.
Sources (3)
- [1]Scientific datasets are riddled with copy-paste errors(https://www.sciencedetective.org/scientific-datasets-are-riddled-with-copy-paste-errors/)
- [2]Gut Microbiota Regulate Motor Deficits and Neuroinflammation in a Model of Parkinson’s Disease(https://www.cell.com/cell/fulltext/S0092-8674(16)31590-2)
- [3]The Pile: An 800GB Dataset of Diverse Text for Language Modeling(https://arxiv.org/abs/2101.00027)