Synthetic Mixed Training Scales Parametric Knowledge Acquisition Beyond RAG
New method mixes synthetic QAs and documents to let language models exceed RAG performance on knowledge tasks.
Naively scaling synthetic data augmentation by training on more tokens or using stronger generators yields diminishing returns below RAG performance (https://arxiv.org/abs/2603.23562).
Synthetic Mixed Training combines synthetic QAs and synthetic documents to leverage complementary training signals, enabling log-linear improvements as synthetic data volume and generator strength increase for a 2.6% relative gain on the QuaLITY benchmark (https://arxiv.org/abs/2603.23562).
Focal Rewriting conditions synthetic document generation on specific questions to improve diversity, allowing a Llama 8B model to outperform RAG by 4.4% on QuaLITY, beat RAG in five of six settings across QuaLITY, LongHealth and FinanceBench, and achieve a 9.1% gain when combined with RAG (https://arxiv.org/abs/2603.23562).
AXIOM: This could mean AI chatbots get better at handling detailed questions on topics like health or finance using built-in knowledge instead of always searching externally, making them faster and more useful for regular people.
Sources (1)
- [1]Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG(https://arxiv.org/abs/2603.23562)