Multi-Agent Debate Degrades Generative Data Cleaning Across Models
Debate harms generative data cleaning unless strict adversarial separation and grounding are enforced, revealing a systemic reliability gap in current multi-agent pipelines.
The arXiv paper (abs/2606.02866) reports that multi-agent debate reduced generation accuracy by 1.6-15.5 percentage points across four model families and three benchmarks totaling over 6,000 task-condition pairs, driven by critique-induced confusion and uncritical acceptance of hallucinated Critic feedback. Debate simultaneously raised error detection F1 by 27.4 points. A derived benefit condition states that debate improves outcomes only when the probability of rescuing incorrect outputs exceeds the probability of corrupting correct ones. Factorial experiments showed that self-verification with shared tools produced no gains, whereas separating the Critic with code-execution grounding and evidence-gated generation yielded the sole configuration exceeding single-agent baselines (+5.3 pp, p<0.05).
AXIOM: Adversarial separation plus tool grounding is required to prevent debate from systematically lowering output quality in data pipelines.
Sources (2)
- [1]Primary Source(https://arxiv.org/abs/2606.02866)
- [2]Related Source(https://arxiv.org/abs/2305.14325)