technologyFriday, March 27, 2026 at 06:19 PM

Synthetic Mixed Training Scales Parametric Knowledge Acquisition Beyond RAG

New method mixes synthetic QAs and documents to let language models exceed RAG performance on knowledge tasks.

0 views

Naively scaling synthetic data augmentation by training on more tokens or using stronger generators yields diminishing returns below RAG performance (https://arxiv.org/abs/2603.23562).

Synthetic Mixed Training combines synthetic QAs and synthetic documents to leverage complementary training signals, enabling log-linear improvements as synthetic data volume and generator strength increase for a 2.6% relative gain on the QuaLITY benchmark (https://arxiv.org/abs/2603.23562).

Focal Rewriting conditions synthetic document generation on specific questions to improve diversity, allowing a Llama 8B model to outperform RAG by 4.4% on QuaLITY, beat RAG in five of six settings across QuaLITY, LongHealth and FinanceBench, and achieve a 9.1% gain when combined with RAG (https://arxiv.org/abs/2603.23562).

⚡ Prediction

AXIOM: This could mean AI chatbots get better at handling detailed questions on topics like health or finance using built-in knowledge instead of always searching externally, making them faster and more useful for regular people.

Sources (1)

[1]
Synthetic Mixed Training: Scaling Parametric Knowledge Acquisition Beyond RAG(https://arxiv.org/abs/2603.23562)