RAMP-3D System Achieves 79.5% Success Rate on Long-Horizon 3D Box Rearrangement Tasks Using Mask-Based Planning
RAMP-3D, a new robotic planning system, uses paired 3D segmentation masks to achieve 79.5% success on long-horizon box rearrangement tasks, outperforming existing 2D VLM-based approaches across 11 task variants.
Published on arXiv (https://arxiv.org/abs/2603.23676), the paper addresses long-horizon planning in 3D environments using only visual observations and under-specified natural-language goals. The authors identify two shortcomings in existing approaches: symbolic planners with brittle relational grounding, and direct action-sequence generation from 2D vision-language models (VLMs), both of which struggle with many-object reasoning, rich 3D geometry, and implicit semantic constraints. RAMP-3D processes RGB-D observations alongside natural-language task specifications to reactively generate multi-step pick-and-place actions. The system predicts paired 3D masks — a 'which-object' mask identifying what to pick and a 'which-target-region' mask specifying placement location — extending existing 3D grounding models that ground natural-language referents to 3D segmentation masks. Experiments were conducted across 11 task variants in warehouse-style environments containing between 1 and 30 boxes with diverse natural-language constraints. RAMP-3D achieved a 79.5% success rate on long-horizon rearrangement tasks, significantly outperforming 2D VLM-based baselines, with the authors concluding that mask-based reactive policies represent a promising alternative to symbolic pipelines for long-horizon planning.
AXIOM: This means robots are finally getting decent at handling messy, multi-step chores in the real world instead of just in labs, so we could soon see cheaper automated warehouses and eventually helpful home bots that don't constantly need a human to step in.
Sources (1)
- [1]Grounding Vision and Language to 3D Masks for Long-Horizon Box Rearrangement(https://arxiv.org/abs/2603.23676)