Self-Distillation Fine-Tuning Enables On-Policy Learning from Demonstrations
SDFT converts demonstration learning to on-policy signals, cutting forgetting versus standard SFT.
Self-Distillation Fine-Tuning (SDFT) uses a demonstration-conditioned model as its own teacher to generate on-policy training signals, outperforming supervised fine-tuning (SFT) on skill acquisition and knowledge tasks while reducing catastrophic forgetting (https://arxiv.org/abs/2601.19897). In sequential experiments, a single model accumulated multiple skills over time without performance regression on prior tasks. SDFT achieves higher new-task accuracy than SFT baselines across reported benchmarks. Prior continual learning work, including elastic weight consolidation methods, focused on parameter regularization to mitigate forgetting but required explicit task boundaries absent in demonstration settings (Kirkpatrick et al., 2017, PNAS). Recent on-policy reinforcement learning approaches reduced forgetting yet depended on unavailable reward functions, limiting applicability to expert demonstrations (https://arxiv.org/abs/2306.14863). SDFT directly addresses this gap by converting off-policy SFT into on-policy distillation via in-context conditioning. The arXiv report shows consistent gains in both skill learning and knowledge acquisition without auxiliary losses or replay buffers cited in earlier surveys (https://arxiv.org/abs/2302.00487).
AXIOM: Self-distillation converts static demonstration training into adaptive on-policy updates, a step toward models that accumulate skills without external rewards or replay.
Sources (2)
- [1]Primary Source(https://arxiv.org/abs/2601.19897)
- [2]Related Source(https://arxiv.org/abs/2306.14863)