Weight Patching Achieves Source-Level Localization of LLM Behaviors
Weight patching method localizes LLM behaviors at the parameter level, revealing hierarchical circuits and improving model merging beyond activation-based techniques.
Mechanistic interpretability advances with weight patching, a parameter-space method that localizes behaviors to specific weights rather than activations in paired LLMs (Sun et al., arXiv:2604.13694). The technique replaces selected module weights from a behavior-specialized model into a base model under fixed inputs, employing a vector-anchor interface to detect recovery of task-relevant control states during open-ended generation. This reveals a consistent hierarchy from shallow source-side carriers through aggregation and routing modules to downstream execution circuits in instruction-following tasks.
Building on causal tracing and activation patching from Meng et al. (arXiv:2202.05262), weight patching addresses the limitation that important activation-space modules may only amplify upstream signals instead of encoding capabilities in parameters. Wang et al. (arXiv:2211.00593) mapped circuits for indirect object identification primarily via activation interventions; the new source-oriented approach identifies genuine parameter-level carriers missed by those methods and prior coverage that emphasized correlational localization over causal parameter edits.
Recovered component scores from weight patching further enable mechanism-aware model merging, yielding improved selective fusion across expert combinations and providing external validation of the identified hierarchy. These findings connect to established patterns in mechanistic interpretability where precise localization supports model editing, oversight, and control, areas mainstream reporting on LLM capabilities has routinely under-explored.
AXIOM: Weight patching shifts mechanistic interpretability from activations to parameters, exposing the exact sources of behaviors like instruction following and enabling safer, more precise model merging.
Sources (3)
- [1]Weight Patching: Toward Source-Level Mechanistic Localization in LLMs(https://arxiv.org/abs/2604.13694)
- [2]Locating and Editing Factual Associations in GPT(https://arxiv.org/abs/2202.05262)
- [3]Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 Small(https://arxiv.org/abs/2211.00593)