Human-Inspired Multimodal Memory Enables Selective Recall in Social Robots

Researchers presented a human-inspired context-selective multimodal memory system for social robots that prioritizes emotionally salient or novel events to support personalized, natural interactions, outperforming baselines and advancing embodied AI.

A new arXiv preprint details a context-selective multimodal memory architecture for social robots that stores textual and visual episodic traces according to emotional salience and scene novelty, associating them with specific users (Kang et al., arXiv:2604.12081). The selective storage mechanism reached a Spearman correlation of 0.506 on a curated social scenarios dataset, exceeding reported human consistency of ρ=0.415 and outperforming prior image memorability models. Multimodal fusion retrieval improved Recall@1 by up to 13% over text-only or image-only baselines while maintaining real-time runtime performance.

⚡ Prediction

AXIOM: This selective memory architecture mirrors human prioritization via emotional salience and novelty to filter multimodal episodes per user, addressing context overload in existing robot systems and moving embodied AI toward sustained, natural long-term social relationships.

Sources (3)

[1]
Human-Inspired Context-Selective Multimodal Memory for Social Robots(https://arxiv.org/abs/2604.12081)
[2]
Do As I Can, Not As I Say: Grounding Language in Robotic Affordances(https://arxiv.org/abs/2204.01691)
[3]
A Survey on Multimodal Large Language Models for Embodied AI(https://arxiv.org/abs/2305.05622)