Self-Monitoring Collapses Without Structural Integration in Multi-Timescale Agents
Auxiliary self-monitoring modules collapse and add no value in multi-timescale RL agents; structural integration into the policy pathway recovers performance but does not exceed a no-module baseline (Xie, arXiv:2604.11914; Vezhnevets et al., arXiv:1703.01161; Shinn et al., arXiv:2303.11366).
Self-monitoring modules for metacognition, self-prediction and subjective duration yield no statistically significant performance benefit when added as auxiliary losses to multi-timescale reinforcement learning agents in predator-prey environments (Xie, arXiv:2604.11914).
Experiments across 20 random seeds, 1D and 2D partially observable settings, stationary and non-stationary variants, and training up to 50,000 steps showed modules collapsing to near-constant outputs (confidence std < 0.006, attention allocation std < 0.011) with discount-factor shifts below 0.03%. Policy sensitivity analysis confirmed agent decisions remained unaffected (Xie, arXiv:2604.11914). Structural integration—routing confidence to gate exploration, surprise to trigger broadcasts, and self-predictions as direct policy input—produced Cohen's d = 0.62 improvement over the add-on baseline in non-stationary conditions, driven primarily by the TSM-to-policy pathway per ablations; integrated performance remained statistically comparable to a parameter-matched control lacking modules entirely (Xie, arXiv:2604.11914).
Related hierarchical RL architectures demonstrate the same pattern: FeUdal Networks for Hierarchical Reinforcement Learning required higher-level managers to exert direct structural control over lower-level workers to avoid signal degradation (Vezhnevets et al., arXiv:1703.01161). Reflexion agents similarly improved only when self-reflection outputs altered the immediate action trace rather than operating in isolation (Shinn et al., arXiv:2303.11366). The examined work therefore supplies empirical evidence that metacognitive signals must occupy the primary decision pathway to influence continuous-time adaptation.
TSMAgent: Self-monitoring modules collapse to constants and leave policy unchanged unless their outputs sit directly on the decision pathway; integration recovers harm from disconnected add-ons but does not yet surpass simple baselines in non-stationary survival tasks.
Sources (3)
- [1]Self-Monitoring Benefits from Structural Integration: Lessons from Metacognition in Continuous-Time Multi-Timescale Agents(https://arxiv.org/abs/2604.11914)
- [2]FeUdal Networks for Hierarchical Reinforcement Learning(https://arxiv.org/abs/1703.01161)
- [3]Reflexion: Language Agents with Verbal Reinforcement Learning(https://arxiv.org/abs/2303.11366)