-
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
- Authors: Georgy Ponimatkin, Martin Cífka, Tomas Soucek, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
- Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, fine-grained dynamic object motions, and the fact that the exact ...
-
A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations
- Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu
- Abstract: Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain hum...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
- Authors: Keyi Shen, Jiangwei Yu, Jose Barreiros, Huan Zhang, Yunzhu Li
- Abstract: Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on ...
-
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
- Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
- Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the dat...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY
- Authors: Chenrui Tie, Yue Chen, Ruihai Wu, Boxuan Dong, Zeyi Li, Chongkai Gao, Hao Dong
- Abstract: Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks.However, extensive demonstrations are required for policy robustness and generalization.To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient tra...
-
Generalizable Motion Planning via Operator Learning
- Authors: Sharath Matada, Luke Bhan, Yuanyuan Shi, Nikolay Atanasov
- Abstract: In this work, we introduce a planning neural operator (PNO) for predicting the value function of a motion planning problem. We recast value function approximation as learning a single operator from the cost function space to the value functionspace, which is defined by an Eikonal partial differentia...
-
Grounding Video Models to Actions through Goal Conditioned Exploration
- Authors: Yunhao Luo, Yilun Du
- Abstract: Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks.However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to rea...
-
Interactive Adjustment for Human Trajectory Prediction with Individual Feedback
- Authors: Jianhua Sun, Yuxuan Li, Liang Chai, Cewu Lu
- Abstract: Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along,...
-
Learning Geometric Reasoning Networks For Robot Task And Motion Planning
- Authors: Smail Ait Bouhsain, Rachid Alami, Thierry Simeon
- Abstract: Task and Motion Planning (TAMP) is a computationally challenging robotics problem due to the tight coupling of discrete symbolic planning and continuous geometric planning of robot motions. In particular, planning manipulation tasks in complex 3D environments leads to a large number of costly geomet...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
- Authors: Junpeng Yue, Xinrun Xu, Börje Karlsson, Zongqing Lu
- Abstract: MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at han...
-
Multi-Robot Motion Planning with Diffusion Models
- Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev
- Abstract: Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robo...
-
Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning
- Authors: Caleb Chuck, Fan Feng, Carl Qi, Chang Shi, Siddhant Agarwal, Amy Zhang, Scott Niekum
- Abstract: Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robot...
-
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning
- Authors: Ruiqi Ni, zherong pan, Ahmed Qureshi
- Abstract: The motion planning problem involves finding a collision-free path from a robot's starting to its target configuration. Recently, self-supervised learning methods have emerged to tackle motion planning problems without requiring expensive expert demonstrations. They solve the Eikonal equation for tr...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
- Authors: Sumeet Singh, Vikas Sindhwani, Stephen Tu
- Abstract: A crucial design decision for any robot learning pipeline is the choice of policy representation: what type of model should be used to generate the next set of robot actions? Owing to the inherent multi-modal nature of many robotic tasks, combined with the recent successes in generative modeling, re...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning
- Authors: Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, Gerhard Neumann
- Abstract: This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajec...
-
VisualAgentBench: Towards Large Multimodal Models as Visual Agents
- Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable visual agents that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase t...
-
3D Vision-Language Gaussian Splatting
- Authors: Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu
- Abstract: Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches h...
-
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
- Authors: Georgy Ponimatkin, Martin Cífka, Tomas Soucek, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
- Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, fine-grained dynamic object motions, and the fact that the exact ...
-
AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning
- Authors: Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, Hao Dong
- Abstract: Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
- Authors: Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, ERIC EATON
- Abstract: Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation.However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we presen...
-
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
- Authors: Keyi Shen, Jiangwei Yu, Jose Barreiros, Huan Zhang, Yunzhu Li
- Abstract: Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on ...
-
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
- Authors: Yuejiang Liu, Jubayer Hamid, Annie Xie, Yoonho Lee, Max Du, Chelsea Finn
- Abstract: Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, whi...
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
- Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Behzad Dariush, Kwonjoon Lee, Chuang Gan
- Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics c...
-
Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback
- Authors: Michelle Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy
- Abstract: In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagree...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Diffusion Policy Policy Optimization
- Authors: Allen Z. Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz
- Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG method...
-
Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations
- Authors: Thomas Tian, Kratarth Goel
- Abstract: Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token...
-
Discriminator-Guided Embodied Planning for LLM Agent
- Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li
- Abstract: Large Language Models (LLMs) have showcased remarkable reasoning capabilities in various domains, yet face challenges in complex embodied tasks due to the need for a coherent long-term policy and context-sensitive environmental understanding. Previous work performed LLM refinement relying on outcome...
-
Efficient Active Imitation Learning with Random Network Distillation
- Authors: Emilien Biré, Anthony Kobanda, Ludovic Denoyer, Rémy Portelas
- Abstract: Developing agents for complex and underspecified tasks, where no clear objective exists, remains challenging but offers many opportunities. This is especially true in video games, where simulated players (bots) need to play realistically, and there is no clear reward to evaluate them. While imitatio...
-
- Authors: Tianfang Zhu, Dongli Hu, Jiandong Zhou, Kai Du, Anan LI
- Abstract: The brain's ability to transform sensory inputs into motor functions is central to neuroscience and crucial for the development of embodied intelligence. Sensory-motor integration involves complex neural circuits, diverse neuronal types, and intricate intercellular connections. Bridging the gap betw...
-
FOSP: Fine-tuning Offline Safe Policy through World Models
- Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
- Abstract: Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the d...
-
Generalized Behavior Learning from Diverse Demonstrations
- Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay
- Abstract: Diverse behavior policies are valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and factors that govern individual behavior variations, which can be used to characterize \it{useful...
-
- Authors: TaiMing Lu, Tianmin Shu, Daniel Khashabi, Alan Yuille, Jieneng Chen
- Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. However, humans can imagine unseen parts of the world through a...
-
Graph Neural Networks Gone Hogwild
- Authors: Olga Solodova, Nick Richardson, Deniz Oktay, Ryan P Adams
- Abstract: Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excl...
-
Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building
- Authors: Zhang-Wei Hong, Akhilan Boopathy, Jaedong Hwang, Eric Chen, Ila Fiete, Pulkit Agrawal
- Abstract: Animals and robots navigate through environments by building and refining maps of space. These maps enable functions including navigation back to home, planning, search and foraging. Here, we use observations from neuroscience, specifically the observed fragmentation of grid cell map in compartmenta...
-
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
- Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
- Abstract: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires ...
-
HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom
- Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
- Abstract: The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduce HASARD (A Benchmark for HArnessing SAfe *...
-
Instant Policy: In-Context Imitation Learning via Graph Diffusion
- Authors: Vitalis Vosylius, Edward Johns
- Abstract: Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key compon...
-
LASeR: Towards Diversified and Generalizable Robot Design with Large Language Models
- Authors: Junru Song, Yang Yang, Huan Xiao, Wei Peng, Wen Yao, Feifei Wang
- Abstract: Recent advances in Large Language Models (LLMs) have stimulated a significant paradigm shift in evolutionary optimization, where hand-crafted search heuristics are gradually replaced with LLMs serving as intelligent search operators. However, these studies still bear some notable limitations, includ...
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
- Authors: Calarina Muslimani, Matthew E Taylor
- Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward f...
-
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo
- Abstract: LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations ...
-
LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality
- Authors: Kojiro Takeyama, Yimeng Liu, Misha Sra
- Abstract: Understanding human locomotion is crucial for AI agents such as robots, particularly in complex indoor home environments. Modeling human trajectories in these spaces requires insight into how individuals maneuver around physical obstacles and manage social navigation dynamics. These dynamics include...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
- Authors: Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
- Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks w...
-
- Authors: Xinyue Wang, Biwei Huang
- Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling...
-
Multi-Robot Motion Planning with Diffusion Models
- Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev
- Abstract: Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robo...
-
- Authors: Qixin ZHANG, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao
- Abstract: Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximati...
-
NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains
- Authors: Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, Honguk Woo
- Abstract: We explore neuro-symbolic approaches to generalize actionable knowledge, enabling embodied agents to tackle complex tasks more effectively in open-domain environments. A key challenge for embodied agents is the generalization of knowledge across diverse environments and situations, as limited experi...
-
Online Neuro-Symbolic Predicate Invention for High-Level Planning
- Authors: Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, Joao F. Henriques, Kevin Ellis
- Abstract: Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the st...
-
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
- Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John Turner, Eric Undersander, Tsung-Yen Yang
- Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We empl...
-
Predicate Hierarchies Improve Few-Shot State Classification
- Authors: Emily Jin, Joy Hsu, Jiajun Wu
- Abstract: State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desider...
-
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
- Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang
- Abstract: Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representati...
-
PWM: Policy Learning with Multi-Task World Models
- Authors: Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg
- Abstract: Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World models methods offer scalability by learning a simulation of the environment, but often rely on inefficient gradient-free optimization methods for policy e...
-
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
- Authors: Patrick Yin, Tyler Westenbroek, Ching-An Cheng, Andrey Kolobov, Abhishek Gupta
- Abstract: Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein t...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
- Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu
- Abstract: The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human vi...
-
Sensor-Invariant Tactile Representation
- Authors: Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
- Abstract: High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This lim...
-
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
- Authors: Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu
- Abstract: The systematic evaluation of speech separation and enhancement models under moving sound source conditions typically requires extensive data comprising diverse scenarios. However, real-world datasets often contain insufficient data to meet the training and evaluation requirements of models. Although...
-
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
- Authors: Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, Yashraj Narang
- Abstract: Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made in developing such strategies for general pick-a...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
- Authors: Kaizhe Hu, Zihang Rui, Yao He, Yuyao Liu, Pu Hua, Huazhe Xu
- Abstract: Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations like variations in lighting and textures. This limitation hampers their practical application in real-world settings. To address this, we propose Stem-OB th...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
- Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
- Abstract: Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals becaus...
-
TopoGaussian: Inferring Internal Topology Structures from Visual Clues
- Authors: Xiaoyu Xiong, Changyu Hu, Chunru Lin, Pingchuan Ma, Chuang Gan, Tao Du
- Abstract: We present TopoGaussian, a holistic, particle-based pipeline for inferring the interior structure of an opaque object from easily accessible photos and videos as input. Traditional mesh-based approaches require tedious and error-prone mesh filling and fixing process, while typically output rough bou...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
What's the Move? Hybrid Imitation Learning via Salient Points
- Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
- Abstract: While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: **S...
-
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models
- Authors: Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, Maosong Sun
- Abstract: Recent advancements in large language models (LLMs) have driven a revolutionary paradigm shift in process automation from Robotic Process Automation to Agentic Process Automation by automating the workflow orchestration procedure based on LLMs. However, existing LLMs (even the advanced OpenAI GPT-4o...
-
- Authors: Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, Liqiang Nie
- Abstract: 3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task.This paradigm relies on predefined labels and lacks the ability to comprehend complex natural lan...
-
- Authors: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
- Abstract: We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rende...
-
3D Vision-Language Gaussian Splatting
- Authors: Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu
- Abstract: Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches h...
-
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
- Authors: Georgy Ponimatkin, Martin Cífka, Tomas Soucek, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
- Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, fine-grained dynamic object motions, and the fact that the exact ...
-
ADAM: An Embodied Causal Agent in Open-World Environments
- Authors: Shu Yu, Chaochao Lu
- Abstract: In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their inter...
-
AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning
- Authors: Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, Hao Dong
- Abstract: Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a...
-
A deep inverse-mapping model for a flapping robotic wing
- Authors: Hadar Sharvit, Raz Karl, Tsevi Beatus
- Abstract: In systems control, the dynamics of a system are governed by modulating its inputs to achieve a desired outcome. For example, to control the thrust of a quad-copter propeller the controller modulates its rotation rate, relying on a straightforward mapping between the input rotation rate and the resu...
-
A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations
- Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu
- Abstract: Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain hum...
-
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
- Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, Huzefa Rangwala
- Abstract: Autonomy via agents based on large language models (LLMs) that can carry out personalized yet standardized tasks presents a significant opportunity to drive human efficiency. There is an emerging need and interest in automating web tasks (e.g., booking a hotel for a given date within a budget). Bei...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
- Authors: Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, ERIC EATON
- Abstract: Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation.However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we presen...
-
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
- Authors: Keyi Shen, Jiangwei Yu, Jose Barreiros, Huan Zhang, Yunzhu Li
- Abstract: Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on ...
-
Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning
- Authors: Wesley Suttle, Aamodh Suresh, Carlos Nieto-Granda
- Abstract: Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently propos...
-
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
- Authors: Yuejiang Liu, Jubayer Hamid, Annie Xie, Yoonho Lee, Max Du, Chelsea Finn
- Abstract: Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, whi...
-
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
- Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
- Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the dat...
-
Breaking Neural Network Scaling Laws with Modularity
- Authors: Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete
- Abstract: Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional and combinatorial structure of real-world problems. However...
-
- Authors: Ruochen Jiao, Shaoyuan Xie, Justin Yue, TAKAMI SATO, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu
- Abstract: Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning pr...
-
CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation
- Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G Snoek, Jan-jakob Sonke, Efstratios Gavves
- Abstract: In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long-term strategic and cooperative planning, leading to r...
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
- Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Behzad Dariush, Kwonjoon Lee, Chuang Gan
- Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics c...
-
Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback
- Authors: Michelle Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy
- Abstract: In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagree...
-
Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery
- Authors: Amin Soleimani Abyaneh, Mahrokh Boroujeni, Hsiu-Chin Lin, Giancarlo Ferrari-Trecate
- Abstract: Imitation learning is a data-driven approach to learning policies from expert behavior, but it is prone to unreliable outcomes in out-of-sample (OOS) regions. While previous research on stable dynamical system policies guarantees convergence to a desired state, it often overlooks transient behavior....
-
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
- Authors: Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu
- Abstract: Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored.In this work, we st...
-
Data Scaling Laws in Imitation Learning for Robotic Manipulation
- Authors: Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, Yang Gao
- Abstract: Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate ...
-
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding
- Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, zhongchao shi, Gao Huang
- Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based...
-
DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from One Demo
- Authors: Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, Huazhe Xu
- Abstract: Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object catego...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Diffusion Policy Policy Optimization
- Authors: Allen Z. Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz
- Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG method...
-
Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations
- Authors: Thomas Tian, Kratarth Goel
- Abstract: Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token...
-
Discriminator-Guided Embodied Planning for LLM Agent
- Authors: Haofu Qian, Chenjia Bai, Jiatao Zhang, Fei Wu, Wei Song, Xuelong Li
- Abstract: Large Language Models (LLMs) have showcased remarkable reasoning capabilities in various domains, yet face challenges in complex embodied tasks due to the need for a coherent long-term policy and context-sensitive environmental understanding. Previous work performed LLM refinement relying on outcome...
-
Dobi-SVD: Differential SVD for LLM Compression and Some New Perspectives
- Authors: Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
- Abstract: Large language models (LLMs) have sparked a new wave of AI applications; however, their substantial computational costs and memory demands pose significant challenges to democratizing access to LLMs for a broader audience. Singular Value Decomposition (SVD), a technique studied for decades, offers a...
-
Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
- Authors: Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves
- Abstract: A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hall...
-
- Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
- Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisf...
-
Efficient Active Imitation Learning with Random Network Distillation
- Authors: Emilien Biré, Anthony Kobanda, Ludovic Denoyer, Rémy Portelas
- Abstract: Developing agents for complex and underspecified tasks, where no clear objective exists, remains challenging but offers many opportunities. This is especially true in video games, where simulated players (bots) need to play realistically, and there is no clear reward to evaluate them. While imitatio...
-
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
- Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
- Abstract: Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior.As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws....
-
Efficient Imitation under Misspecification
- Authors: Nicolas Espinosa Dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy
- Abstract: Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert's policy lies within the learner's policy clas...
-
Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
- Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok
- Abstract: Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this ide...
-
Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping
- Authors: Ziye Huang, Haoqi Yuan, Yuhui Fu, Zongqing Lu
- Abstract: Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task ...
-
- Authors: Tianfang Zhu, Dongli Hu, Jiandong Zhou, Kai Du, Anan LI
- Abstract: The brain's ability to transform sensory inputs into motor functions is central to neuroscience and crucial for the development of embodied intelligence. Sensory-motor integration involves complex neural circuits, diverse neuronal types, and intricate intercellular connections. Bridging the gap betw...
-
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
- Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
- Abstract: Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is infeasible. Meanw...
-
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
- Authors: Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao
- Abstract: Heterogeneous multi-robot systems (HMRS) have emerged as a powerful ap-proach for tackling complex tasks that single robots cannot manage alone. Currentlarge-language-model-based multi-agent systems (LLM-based MAS) have shownsuccess in areas like software development and operating systems, but apply...
-
ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY
- Authors: Chenrui Tie, Yue Chen, Ruihai Wu, Boxuan Dong, Zeyi Li, Chongkai Gao, Hao Dong
- Abstract: Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks.However, extensive demonstrations are required for policy robustness and generalization.To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient tra...
-
EXPLOITING DISTRIBUTION CONSTRAINTS FOR SCALABLE AND EFFICIENT IMAGE RETRIEVAL
- Authors: Mohammad Omama, Po-han Li, Sandeep Chinchali
- Abstract: Image retrieval is crucial in robotics and computer vision, with downstream applications in robot place recognition and vision-based product recommendations. Modern retrieval systems face two key challenges: scalability and efficiency.State-of-the-art image retrieval systems train specific neural ne...
-
FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm...
-
Following the Human Thread in Social Navigation
- Authors: Luca Scofano, Alessio Sampieri, Tommaso Campari, Valentino Sacco, Indro Spinelli, Lamberto Ballan, Fabio Galasso
- Abstract: The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human traje...
-
FOSP: Fine-tuning Offline Safe Policy through World Models
- Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
- Abstract: Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the d...
-
Generalizable Motion Planning via Operator Learning
- Authors: Sharath Matada, Luke Bhan, Yuanyuan Shi, Nikolay Atanasov
- Abstract: In this work, we introduce a planning neural operator (PNO) for predicting the value function of a motion planning problem. We recast value function approximation as learning a single operator from the cost function space to the value functionspace, which is defined by an Eikonal partial differentia...
-
Generalized Behavior Learning from Diverse Demonstrations
- Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay
- Abstract: Diverse behavior policies are valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and factors that govern individual behavior variations, which can be used to characterize \it{useful...
-
General Scene Adaptation for Vision-and-Language Navigation
- Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
- Abstract: Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in pers...
-
Generating Freeform Endoskeletal Robots
- Authors: Muhan Li, Lingji Kong, Sam Kriegman
- Abstract: The automatic design of embodied agents (e.g. robots) has existed for 31 years and is experiencing a renaissance of interest in the literature. To date however, the field has remained narrowly focused on two kinds of anatomically simple robots: (1) fully rigid, jointed bodies; and (2) fully soft, jo...
-
- Authors: TaiMing Lu, Tianmin Shu, Daniel Khashabi, Alan Yuille, Jieneng Chen
- Abstract: Planning with partial observation is a central challenge in embodied AI. A majority of prior works have tackled this challenge by developing agents that physically explore their environment to update their beliefs about the world state. However, humans can imagine unseen parts of the world through a...
-
Genesis: Advancing Towards Efficient Embodiment Co-Design
- Authors: Haofei Lu, Zhe Wu, Junliang Xing, Jianshu Li, Ruoyu Li, Zhe Li, Yuanchun Shi
- Abstract: Embodiment co-design aims to optimize a robot's morphology and control simultaneously. Previous research has demonstrated its potential for generating environment-adaptive robots.However, the problem is inherently combinatorial and the morphology is changeable and agnostic in its vast search space,...
-
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects
- Authors: Tai Hoang, Huy Le, Philipp Becker, Vien A Ngo, Gerhard Neumann
- Abstract: Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterog...
-
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
- Authors: Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
- Abstract: With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment....
-
Graph Neural Networks Gone Hogwild
- Authors: Olga Solodova, Nick Richardson, Deniz Oktay, Ryan P Adams
- Abstract: Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excl...
-
GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation
- Authors: Yangtao Chen, Chen, Junhui Yin, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yang Gao
- Abstract: Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in un...
-
Grid Cell-Inspired Fragmentation and Recall for Efficient Map Building
- Authors: Zhang-Wei Hong, Akhilan Boopathy, Jaedong Hwang, Eric Chen, Ila Fiete, Pulkit Agrawal
- Abstract: Animals and robots navigate through environments by building and refining maps of space. These maps enable functions including navigation back to home, planning, search and foraging. Here, we use observations from neuroscience, specifically the observed fragmentation of grid cell map in compartmenta...
-
GROOT-2: Weakly Supervised Multimodal Instruction Following Agents
- Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
- Abstract: Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset w...
-
Grounding Video Models to Actions through Goal Conditioned Exploration
- Authors: Yunhao Luo, Yilun Du
- Abstract: Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks.However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to rea...
-
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
- Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
- Abstract: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires ...
-
HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom
- Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
- Abstract: The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduce HASARD (A Benchmark for HArnessing SAfe *...
-
HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining
- Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim
- Abstract: Text-to-motion generation has significant potential in a wide range of applications including animation, robotics, and AR/VR. While recent works on masked motion models are promising, the task remains challenging due to the inherent ambiguity in text and the complexity of human motion dynamics. To o...
-
ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
- Authors: Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang
- Abstract: Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLM...
-
Instant Policy: In-Context Imitation Learning via Graph Diffusion
- Authors: Vitalis Vosylius, Edward Johns
- Abstract: Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key compon...
-
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
- Authors: Cong Lu, Shengran Hu, Jeff Clune
- Abstract: Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challen...
-
Interactive Adjustment for Human Trajectory Prediction with Individual Feedback
- Authors: Jianhua Sun, Yuxuan Li, Liang Chai, Cewu Lu
- Abstract: Human trajectory prediction is fundamental for autonomous driving and service robot. The research community has studied various important aspects of this task and made remarkable progress recently. However, there is an essential perspective which is not well exploited in previous research all along,...
-
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- Authors: Weize Chen, Ziming You, Ran Li, yitong guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
- Abstract: The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. ...
-
Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment
- Authors: Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
- Abstract: Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning...
-
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
- Authors: Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
- Abstract: While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge.In this work, we take a step towards this goal by...
-
Language Guided Skill Discovery
- Authors: Seungeun Rho, Laura Smith, Tianyu Li, Sergey Levine, Xue Bin Peng, Sehoon Ha
- Abstract: Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for downstream tasks, obtaining a semantically diverse repertoire of skills is crucial. While some approaches use discriminators to acquire distinguishable skills and oth...
-
LASeR: Towards Diversified and Generalizable Robot Design with Large Language Models
- Authors: Junru Song, Yang Yang, Huan Xiao, Wei Peng, Wen Yao, Feifei Wang
- Abstract: Recent advances in Large Language Models (LLMs) have stimulated a significant paradigm shift in evolutionary optimization, where hand-crafted search heuristics are gradually replaced with LLMs serving as intelligent search operators. However, these studies still bear some notable limitations, includ...
-
Latent Action Pretraining from Videos
- Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
- Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators...
-
Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations
- Authors: Pei Zhou, Ruizhe Liu, Qian Luo, Yibing Song, Fan Wang, Yanchao Yang
- Abstract: Training embodied agents to perform complex robotic tasks presents significant challenges due to the entangled factors of task compositionality, environmental diversity, and dynamic changes. In this work, we introduce a novel imitation learning framework to train closed-loop concept-guided policies ...
-
Learning Geometric Reasoning Networks For Robot Task And Motion Planning
- Authors: Smail Ait Bouhsain, Rachid Alami, Thierry Simeon
- Abstract: Task and Motion Planning (TAMP) is a computationally challenging robotics problem due to the tight coupling of discrete symbolic planning and continuous geometric planning of robot motions. In particular, planning manipulation tasks in complex 3D environments leads to a large number of costly geomet...
-
Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
- Authors: Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, Di Hu
- Abstract: Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to precisely understand and manipulate objects. Over time, numerous meticulously designed visuo-tactile sensors have been integrated into robotic systems, aiding in completing various tasks. However, the distinct data cha...
-
Learning View-invariant World Models for Visual Robotic Manipulation
- Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu
- Abstract: Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invarian...
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
- Authors: Calarina Muslimani, Matthew E Taylor
- Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward f...
-
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo
- Abstract: LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations ...
-
LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality
- Authors: Kojiro Takeyama, Yimeng Liu, Misha Sra
- Abstract: Understanding human locomotion is crucial for AI agents such as robots, particularly in complex indoor home environments. Modeling human trajectories in these spaces requires insight into how individuals maneuver around physical obstacles and manage social navigation dynamics. These dynamics include...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
- Authors: Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
- Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks w...
-
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
- Authors: Junpeng Yue, Xinrun Xu, Börje Karlsson, Zongqing Lu
- Abstract: MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at han...
-
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
- Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, DAE SHIK KIM
- Abstract: The fusion of Large Language Models (LLMs) with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. How...
-
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dy...
-
- Authors: Xinyue Wang, Biwei Huang
- Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling...
-
Motion Control of High-Dimensional Musculoskeletal System with Hierarchical Model-Based Planning
- Authors: Yunyue Wei, Shanning Zhuang, Vincent Zhuang, Yanan Sui
- Abstract: Controlling high-dimensional nonlinear systems presents significant challenges in biological and robotic applications due to the large state and action spaces. While deep reinforcement learning has emerged as the leading approach, it suffers from computationally-intensive and time-consuming, and are...
-
Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory
- Authors: Junyeong Park, Junmo Cho, Sungjin Ahn
- Abstract: Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers freque...
-
Multi-Robot Motion Planning with Diffusion Models
- Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev
- Abstract: Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robo...
-
- Authors: Qixin ZHANG, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao
- Abstract: Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximati...
-
NeSyC: A Neuro-symbolic Continual Learner For Complex Embodied Tasks in Open Domains
- Authors: Wonje Choi, Jinwoo Park, Sanghyun Ahn, Daehee Lee, Honguk Woo
- Abstract: We explore neuro-symbolic approaches to generalize actionable knowledge, enabling embodied agents to tackle complex tasks more effectively in open-domain environments. A key challenge for embodied agents is the generalization of knowledge across diverse environments and situations, as limited experi...
-
Neural Wave Equation for Irregularly Sampled Sequence Data
- Authors: Arkaprava Majumdar, M Krishna, P. K. Srijith
- Abstract: Sequence labeling problems arise in several real-world applications such as healthcare and robotics. In many such applications, sequence data are irregularly sampled and are of varying complexities. Recently, efforts have been made to develop neural ODE-based architectures to model the evolution of ...
-
- Authors: Anish Abhijit Diwan, Julen Urain, Jens Kober, Jan Peters
- Abstract: This paper introduces a new imitation learning framework based on energy-based generative models capable of learning complex, physics-dependent, robot motion policies through state-only expert motion trajectories. Our algorithm, called Noise-conditioned Energy-based Annealed Rewards (NEAR), construc...
-
Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning
- Authors: Caleb Chuck, Fan Feng, Carl Qi, Chang Shi, Siddhant Agarwal, Amy Zhang, Scott Niekum
- Abstract: Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robot...
-
Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy
- Authors: Mingyang Zhao, Gaofeng Meng, Dong-ming Yan
- Abstract: Non-rigid alignment of point clouds is crucial for scene understanding, reconstruction, and various computer vision and robotics tasks. Recent advancements in implicit deformation networks for non-rigid registration have significantly reduced the reliance on large amounts of annotated training data....
-
Offline Hierarchical Reinforcement Learning via Inverse Optimization
- Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
- Abstract: Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a sig...
-
Online Neuro-Symbolic Predicate Invention for High-Level Planning
- Authors: Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, Joao F. Henriques, Kevin Ellis
- Abstract: Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the st...
-
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
- Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John Turner, Eric Undersander, Tsung-Yen Yang
- Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We empl...
-
PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning
- Authors: Utsav Singh, Vinay Namboodiri
- Abstract: Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-p...
-
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning
- Authors: Ruiqi Ni, zherong pan, Ahmed Qureshi
- Abstract: The motion planning problem involves finding a collision-free path from a robot's starting to its target configuration. Recently, self-supervised learning methods have emerged to tackle motion planning problems without requiring expensive expert demonstrations. They solve the Eikonal equation for tr...
-
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation
- Authors: Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
- Abstract: Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obs...
-
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model
- Authors: Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Zhang, Hao Su
- Abstract: Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation l...
-
Predicate Hierarchies Improve Few-Shot State Classification
- Authors: Emily Jin, Joy Hsu, Jiajun Wu
- Abstract: State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desider...
-
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
- Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang
- Abstract: Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representati...
-
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
- Authors: Joey Hong, Anca Dragan, Sergey Levine
- Abstract: Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to ...
-
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
- Authors: Patrick Yin, Tyler Westenbroek, Ching-An Cheng, Andrey Kolobov, Abhishek Gupta
- Abstract: Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein t...
-
- Authors: Charles Dawson, Van Tran, Max Li, Chuchu Fan
- Abstract: Increased deployment of autonomous systems in fields like transportation and robotics have seen a corresponding increase in safety-critical failures. These failures can be difficult to model and debug due to the relative lack of data: compared to tens of thousands of examples from normal operations,...
-
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
- Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu
- Abstract: Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Tr...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments
- Authors: Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, Insup Lee
- Abstract: Do generalist agents only require large models pre-trained on massive amounts of data to rapidly adapt to new environments? We propose a novel approach to pre-train relatively small models and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that ret...
-
Residual Deep Gaussian Processes on Manifolds
- Authors: Kacper Wyrwal, Andreas Krause, Viacheslav (Slava) Borovitskiy
- Abstract: We propose practical deep Gaussian process models on Riemannian manifolds, similar in spirit to residual neural networks.With manifold-to-manifold hidden layers and an arbitrary last layer, they can model manifold- and scalar-valued functions, as well as vector fields.We target data inherently suppo...
-
- Authors: Sumeet Singh, Vikas Sindhwani, Stephen Tu
- Abstract: A crucial design decision for any robot learning pipeline is the choice of policy representation: what type of model should be used to generate the next set of robot actions? Owing to the inherent multi-modal nature of many robotic tasks, combined with the recent successes in generative modeling, re...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Robotouille: An Asynchronous Planning Benchmark for LLM Agents
- Authors: Gonzalo Gonzalez-Pumariega, Leong Yean, Neha Sunkara, Sanjiban Choudhury
- Abstract: Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large langu...
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
- Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu
- Abstract: The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human vi...
-
Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning
- Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos
- Abstract: Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchm...
-
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
- Authors: Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan, Ramin Hasani, Mathias Lechner, Daniela Rus
- Abstract: Diffusion models have shown promise in data-driven planning. While these planners are commonly employed in applications where decisions are critical, they still lack established safety guarantees. In this paper, we address this limitation by introducing SafeDiffuser, a method to equip diffusion mode...
-
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
- Authors: Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
- Abstract: Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected throu...
-
Select before Act: Spatially Decoupled Action Repetition for Continuous Control
- Authors: Buqing Nie, Yangqing Fu, Yue Gao
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in various continuous control tasks, such as robot manipulation and locomotion.Different to mainstream RL which makes decisions at individual steps, recent studies have incorporated action repetition into RL, achieving enhanced action persi...
-
Sensor-Invariant Tactile Representation
- Authors: Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
- Abstract: High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This lim...
-
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction
- Authors: Yang Zhou, Hao Shao, Letian Wang, Steven Waslander, Hongsheng Li, Yu Liu
- Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limitin...
-
Solving New Tasks by Adapting Internet Video Knowledge
- Authors: Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun
- Abstract: Video generative models, beyond enabling the production of astounding visual creations, offer a promising pathway for unlocking novel, text-conditioned robotic behaviors, whether utilized as a video planner or as a policy supervisor. When pretrained on internet-scale datasets, such video models int...
-
SonicSim: A customizable simulation platform for speech processing in moving sound source scenarios
- Authors: Kai Li, Wendi Sang, Chang Zeng, Runxuan Yang, Guo Chen, Xiaolin Hu
- Abstract: The systematic evaluation of speech separation and enhancement models under moving sound source conditions typically requires extensive data comprising diverse scenarios. However, real-world datasets often contain insufficient data to meet the training and evaluation requirements of models. Although...
-
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
- Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He
- Abstract: In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understandi...
-
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
- Authors: Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, Yashraj Narang
- Abstract: Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made in developing such strategies for general pick-a...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
- Authors: Kaizhe Hu, Zihang Rui, Yao He, Yuyao Liu, Pu Hua, Huazhe Xu
- Abstract: Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations like variations in lighting and textures. This limitation hampers their practical application in real-world settings. To address this, we propose Stem-OB th...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
Student-Informed Teacher Training
- Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza
- Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limit...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
The Value of Sensory Information to a Robot
- Authors: Arjun Krishna, Edward Hu, Dinesh Jayaraman
- Abstract: A decision-making agent, such as a robot, must observe and react to any new task-relevant information that becomes available from its environment. We seek to study a fundamental scientific question: what value does sensory information hold to an agent at various moments in time during the execution ...
-
ThinkBot: Embodied Instruction Following with Thought Chain Reasoning
- Authors: Guanxing Lu, Ziwei Wang, Changliu Liu, Jiwen Lu, Yansong Tang
- Abstract: Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals becaus...
-
Think Then React: Towards Unconstrained Action-to-Reaction Motion Generation
- Authors: Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang, Ruihua Song
- Abstract: Modeling human-like action-to-reaction generation has significant real-world applications, like human-robot interaction and games.Despite recent advancements in single-person motion generation, it is still challenging to well handle action-to-reaction generation, due to the difficulty of directly pr...
-
TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning
- Authors: Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, Gerhard Neumann
- Abstract: This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajec...
-
TopoDiffusionNet: A Topology-aware Diffusion Model
- Authors: Saumya Gupta, Dimitris Samaras, Chao Chen
- Abstract: Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constrai...
-
TopoGaussian: Inferring Internal Topology Structures from Visual Clues
- Authors: Xiaoyu Xiong, Changyu Hu, Chunru Lin, Pingchuan Ma, Chuang Gan, Tao Du
- Abstract: We present TopoGaussian, a holistic, particle-based pipeline for inferring the interior structure of an opaque object from easily accessible photos and videos as input. Traditional mesh-based approaches require tedious and error-prone mesh filling and fixing process, while typically output rough bou...
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- Authors: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
- Abstract: Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. ...
-
Understanding Long Videos with Multimodal Language Models
- Authors: Kanchana Ranasinghe, Xiang Li, Kumara Kahatapitiya, Michael Ryoo
- Abstract: Large Language Models (LLMs) have allowed recent LLM-based approaches to achieve excellent performance on long-video understanding benchmarks. We investigate how extensive world knowledge and strong reasoning skills of LLMs influence evaluations on standard long video benchmarks. Surprisingly, we di...
-
Vision Language Models are In-Context Value Learners
- Authors: Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, Fei Xia
- Abstract: Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can s...
-
VisualAgentBench: Towards Large Multimodal Models as Visual Agents
- Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable visual agents that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase t...
-
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
- Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang
- Abstract: Vision-language-action models (VLAs) have recently become highly prevalent in robot manipulation due to its end-to-end architecture and impressive performance. However, current VLAs are limited to processing human instructions in textual form, neglecting the more natural speech modality for human in...
-
- Authors: Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye
- Abstract: Vision and touch are the most commonly used senses in human manipulation. While leveraging human manipulation videos for robotic task pretraining has shown promise in prior works, it is limited to image and language modalities and deployment to simple parallel grippers. In this paper, aiming to addr...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
What's the Move? Hybrid Imitation Learning via Salient Points
- Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
- Abstract: While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: **S...
-
WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models
- Authors: Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, Maosong Sun
- Abstract: Recent advancements in large language models (LLMs) have driven a revolutionary paradigm shift in process automation from Robotic Process Automation to Agentic Process Automation by automating the workflow orchestration procedure based on LLMs. However, existing LLMs (even the advanced OpenAI GPT-4o...
-
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
- Authors: Xinyan Chen, Jianfei Yang
- Abstract: Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiD...
-
X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
- Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
- Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence.In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and t...
-
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
- Authors: Georgy Ponimatkin, Martin Cífka, Tomas Soucek, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
- Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, fine-grained dynamic object motions, and the fact that the exact ...
-
ADAM: An Embodied Causal Agent in Open-World Environments
- Authors: Shu Yu, Chaochao Lu
- Abstract: In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their inter...
-
A deep inverse-mapping model for a flapping robotic wing
- Authors: Hadar Sharvit, Raz Karl, Tsevi Beatus
- Abstract: In systems control, the dynamics of a system are governed by modulating its inputs to achieve a desired outcome. For example, to control the thrust of a quad-copter propeller the controller modulates its rotation rate, relying on a straightforward mapping between the input rotation rate and the resu...
-
A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations
- Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu
- Abstract: Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain hum...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
- Authors: Keyi Shen, Jiangwei Yu, Jose Barreiros, Huan Zhang, Yunzhu Li
- Abstract: Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on ...
-
Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning
- Authors: Wesley Suttle, Aamodh Suresh, Carlos Nieto-Granda
- Abstract: Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently propos...
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
- Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Behzad Dariush, Kwonjoon Lee, Chuang Gan
- Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics c...
-
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
- Authors: Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu
- Abstract: Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored.In this work, we st...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Diffusion Policy Policy Optimization
- Authors: Allen Z. Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz
- Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG method...
-
Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
- Authors: Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves
- Abstract: A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hall...
-
- Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
- Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisf...
-
Efficient Imitation under Misspecification
- Authors: Nicolas Espinosa Dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy
- Abstract: Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert's policy lies within the learner's policy clas...
-
Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
- Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok
- Abstract: Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this ide...
-
Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping
- Authors: Ziye Huang, Haoqi Yuan, Yuhui Fu, Zongqing Lu
- Abstract: Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task ...
-
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
- Authors: Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao
- Abstract: Heterogeneous multi-robot systems (HMRS) have emerged as a powerful ap-proach for tackling complex tasks that single robots cannot manage alone. Currentlarge-language-model-based multi-agent systems (LLM-based MAS) have shownsuccess in areas like software development and operating systems, but apply...
-
FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm...
-
Following the Human Thread in Social Navigation
- Authors: Luca Scofano, Alessio Sampieri, Tommaso Campari, Valentino Sacco, Indro Spinelli, Lamberto Ballan, Fabio Galasso
- Abstract: The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human traje...
-
FOSP: Fine-tuning Offline Safe Policy through World Models
- Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
- Abstract: Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the d...
-
Generalized Behavior Learning from Diverse Demonstrations
- Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay
- Abstract: Diverse behavior policies are valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and factors that govern individual behavior variations, which can be used to characterize \it{useful...
-
Generating Freeform Endoskeletal Robots
- Authors: Muhan Li, Lingji Kong, Sam Kriegman
- Abstract: The automatic design of embodied agents (e.g. robots) has existed for 31 years and is experiencing a renaissance of interest in the literature. To date however, the field has remained narrowly focused on two kinds of anatomically simple robots: (1) fully rigid, jointed bodies; and (2) fully soft, jo...
-
Genesis: Advancing Towards Efficient Embodiment Co-Design
- Authors: Haofei Lu, Zhe Wu, Junliang Xing, Jianshu Li, Ruoyu Li, Zhe Li, Yuanchun Shi
- Abstract: Embodiment co-design aims to optimize a robot's morphology and control simultaneously. Previous research has demonstrated its potential for generating environment-adaptive robots.However, the problem is inherently combinatorial and the morphology is changeable and agnostic in its vast search space,...
-
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects
- Authors: Tai Hoang, Huy Le, Philipp Becker, Vien A Ngo, Gerhard Neumann
- Abstract: Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterog...
-
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
- Authors: Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
- Abstract: With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment....
-
Grounding Video Models to Actions through Goal Conditioned Exploration
- Authors: Yunhao Luo, Yilun Du
- Abstract: Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks.However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to rea...
-
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
- Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
- Abstract: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires ...
-
HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom
- Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
- Abstract: The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduce HASARD (A Benchmark for HArnessing SAfe *...
-
HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining
- Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim
- Abstract: Text-to-motion generation has significant potential in a wide range of applications including animation, robotics, and AR/VR. While recent works on masked motion models are promising, the task remains challenging due to the inherent ambiguity in text and the complexity of human motion dynamics. To o...
-
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
- Authors: Cong Lu, Shengran Hu, Jeff Clune
- Abstract: Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challen...
-
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- Authors: Weize Chen, Ziming You, Ran Li, yitong guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
- Abstract: The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. ...
-
Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment
- Authors: Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
- Abstract: Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning...
-
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
- Authors: Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
- Abstract: While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge.In this work, we take a step towards this goal by...
-
Learning View-invariant World Models for Visual Robotic Manipulation
- Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu
- Abstract: Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invarian...
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
- Authors: Calarina Muslimani, Matthew E Taylor
- Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward f...
-
LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality
- Authors: Kojiro Takeyama, Yimeng Liu, Misha Sra
- Abstract: Understanding human locomotion is crucial for AI agents such as robots, particularly in complex indoor home environments. Modeling human trajectories in these spaces requires insight into how individuals maneuver around physical obstacles and manage social navigation dynamics. These dynamics include...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
- Authors: Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
- Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks w...
-
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
- Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, DAE SHIK KIM
- Abstract: The fusion of Large Language Models (LLMs) with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. How...
-
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dy...
-
- Authors: Xinyue Wang, Biwei Huang
- Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling...
-
Motion Control of High-Dimensional Musculoskeletal System with Hierarchical Model-Based Planning
- Authors: Yunyue Wei, Shanning Zhuang, Vincent Zhuang, Yanan Sui
- Abstract: Controlling high-dimensional nonlinear systems presents significant challenges in biological and robotic applications due to the large state and action spaces. While deep reinforcement learning has emerged as the leading approach, it suffers from computationally-intensive and time-consuming, and are...
-
Mr.Steve: Instruction-Following Agents in Minecraft with What-Where-When Memory
- Authors: Junyeong Park, Junmo Cho, Sungjin Ahn
- Abstract: Significant advances have been made in developing general-purpose embodied AI in environments like Minecraft through the adoption of LLM-augmented hierarchical approaches. While these approaches, which combine high-level planners with low-level controllers, show promise, low-level controllers freque...
-
- Authors: Qixin ZHANG, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao
- Abstract: Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximati...
-
- Authors: Anish Abhijit Diwan, Julen Urain, Jens Kober, Jan Peters
- Abstract: This paper introduces a new imitation learning framework based on energy-based generative models capable of learning complex, physics-dependent, robot motion policies through state-only expert motion trajectories. Our algorithm, called Noise-conditioned Energy-based Annealed Rewards (NEAR), construc...
-
Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning
- Authors: Caleb Chuck, Fan Feng, Carl Qi, Chang Shi, Siddhant Agarwal, Amy Zhang, Scott Niekum
- Abstract: Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robot...
-
Offline Hierarchical Reinforcement Learning via Inverse Optimization
- Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
- Abstract: Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a sig...
-
Online Neuro-Symbolic Predicate Invention for High-Level Planning
- Authors: Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, Joao F. Henriques, Kevin Ellis
- Abstract: Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the st...
-
PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning
- Authors: Utsav Singh, Vinay Namboodiri
- Abstract: Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-p...
-
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation
- Authors: Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
- Abstract: Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obs...
-
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model
- Authors: Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Zhang, Hao Su
- Abstract: Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation l...
-
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
- Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang
- Abstract: Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representati...
-
PWM: Policy Learning with Multi-Task World Models
- Authors: Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg
- Abstract: Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World models methods offer scalability by learning a simulation of the environment, but often rely on inefficient gradient-free optimization methods for policy e...
-
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
- Authors: Joey Hong, Anca Dragan, Sergey Levine
- Abstract: Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to ...
-
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
- Authors: Patrick Yin, Tyler Westenbroek, Ching-An Cheng, Andrey Kolobov, Abhishek Gupta
- Abstract: Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein t...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
- Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu
- Abstract: The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human vi...
-
Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning
- Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos
- Abstract: Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchm...
-
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
- Authors: Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan, Ramin Hasani, Mathias Lechner, Daniela Rus
- Abstract: Diffusion models have shown promise in data-driven planning. While these planners are commonly employed in applications where decisions are critical, they still lack established safety guarantees. In this paper, we address this limitation by introducing SafeDiffuser, a method to equip diffusion mode...
-
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
- Authors: Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
- Abstract: Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected throu...
-
Select before Act: Spatially Decoupled Action Repetition for Continuous Control
- Authors: Buqing Nie, Yangqing Fu, Yue Gao
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in various continuous control tasks, such as robot manipulation and locomotion.Different to mainstream RL which makes decisions at individual steps, recent studies have incorporated action repetition into RL, achieving enhanced action persi...
-
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
- Authors: Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, Yashraj Narang
- Abstract: Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made in developing such strategies for general pick-a...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
Student-Informed Teacher Training
- Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza
- Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limit...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
The Value of Sensory Information to a Robot
- Authors: Arjun Krishna, Edward Hu, Dinesh Jayaraman
- Abstract: A decision-making agent, such as a robot, must observe and react to any new task-relevant information that becomes available from its environment. We seek to study a fundamental scientific question: what value does sensory information hold to an agent at various moments in time during the execution ...
-
TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning
- Authors: Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, Gerhard Neumann
- Abstract: This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajec...
-
TopoDiffusionNet: A Topology-aware Diffusion Model
- Authors: Saumya Gupta, Dimitris Samaras, Chao Chen
- Abstract: Diffusion models excel at creating visually impressive images but often struggle to generate images with a specified topology. The Betti number, which represents the number of structures in an image, is a fundamental measure in topology. Yet, diffusion models fail to satisfy even this basic constrai...
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- Authors: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
- Abstract: Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. ...
-
- Authors: Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye
- Abstract: Vision and touch are the most commonly used senses in human manipulation. While leveraging human manipulation videos for robotic task pretraining has shown promise in prior works, it is limited to image and language modalities and deployment to simple parallel grippers. In this paper, aiming to addr...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
- Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
- Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence.In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and t...
-
- Authors: Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, Liqiang Nie
- Abstract: 3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task.This paradigm relies on predefined labels and lacks the ability to comprehend complex natural lan...
-
- Authors: Xueyan Zou, Yuchen Song, Ri-Zhao Qiu, Xuanbin Peng, Jianglong Ye, Sifei Liu, Xiaolong Wang
- Abstract: We present 3D Spatial MultiModal Memory (M3), a multimodal memory system designed to retain information about medium-sized static scenes through video sources for visual perception. By integrating 3D Gaussian Splatting techniques with foundation models, M3 builds a multimodal memory capable of rende...
-
3D Vision-Language Gaussian Splatting
- Authors: Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu
- Abstract: Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches h...
-
ADAM: An Embodied Causal Agent in Open-World Environments
- Authors: Shu Yu, Chaochao Lu
- Abstract: In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their inter...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
- Authors: Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, ERIC EATON
- Abstract: Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation.However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we presen...
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
- Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Behzad Dariush, Kwonjoon Lee, Chuang Gan
- Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics c...
-
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
- Authors: Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu
- Abstract: Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored.In this work, we st...
-
Data Scaling Laws in Imitation Learning for Robotic Manipulation
- Authors: Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, Yang Gao
- Abstract: Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate ...
-
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding
- Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, zhongchao shi, Gao Huang
- Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based...
-
- Authors: Tianfang Zhu, Dongli Hu, Jiandong Zhou, Kai Du, Anan LI
- Abstract: The brain's ability to transform sensory inputs into motor functions is central to neuroscience and crucial for the development of embodied intelligence. Sensory-motor integration involves complex neural circuits, diverse neuronal types, and intricate intercellular connections. Bridging the gap betw...
-
EmbodiedSAM: Online Segment Any 3D Thing in Real Time
- Authors: Xiuwei Xu, Huangxing Chen, Linqing Zhao, Ziwei Wang, Jie Zhou, Jiwen Lu
- Abstract: Embodied tasks require the agent to fully understand 3D scenes simultaneously with its exploration, so an online, real-time, fine-grained and highly-generalized 3D perception model is desperately needed. Since high-quality 3D data is limited, directly training such a model in 3D is infeasible. Meanw...
-
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
- Authors: Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao
- Abstract: Heterogeneous multi-robot systems (HMRS) have emerged as a powerful ap-proach for tackling complex tasks that single robots cannot manage alone. Currentlarge-language-model-based multi-agent systems (LLM-based MAS) have shownsuccess in areas like software development and operating systems, but apply...
-
EXPLOITING DISTRIBUTION CONSTRAINTS FOR SCALABLE AND EFFICIENT IMAGE RETRIEVAL
- Authors: Mohammad Omama, Po-han Li, Sandeep Chinchali
- Abstract: Image retrieval is crucial in robotics and computer vision, with downstream applications in robot place recognition and vision-based product recommendations. Modern retrieval systems face two key challenges: scalability and efficiency.State-of-the-art image retrieval systems train specific neural ne...
-
FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm...
-
FOSP: Fine-tuning Offline Safe Policy through World Models
- Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
- Abstract: Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the d...
-
General Scene Adaptation for Vision-and-Language Navigation
- Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
- Abstract: Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in pers...
-
Generating Freeform Endoskeletal Robots
- Authors: Muhan Li, Lingji Kong, Sam Kriegman
- Abstract: The automatic design of embodied agents (e.g. robots) has existed for 31 years and is experiencing a renaissance of interest in the literature. To date however, the field has remained narrowly focused on two kinds of anatomically simple robots: (1) fully rigid, jointed bodies; and (2) fully soft, jo...
-
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
- Authors: Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
- Abstract: With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment....
-
Graph Neural Networks Gone Hogwild
- Authors: Olga Solodova, Nick Richardson, Deniz Oktay, Ryan P Adams
- Abstract: Graph neural networks (GNNs) appear to be powerful tools to learn state representations for agents in distributed, decentralized multi-agent systems, but generate catastrophically incorrect predictions when nodes update asynchronously during inference. This failure under asynchrony effectively excl...
-
GROOT-2: Weakly Supervised Multimodal Instruction Following Agents
- Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
- Abstract: Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset w...
-
Grounding Video Models to Actions through Goal Conditioned Exploration
- Authors: Yunhao Luo, Yilun Du
- Abstract: Large video models, pretrained on massive quantities of amount of Internet video, provide a rich source of physical knowledge about the dynamics and motions of objects and tasks.However, video models are not grounded in the embodiment of an agent, and do not describe how to actuate the world to rea...
-
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
- Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
- Abstract: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires ...
-
HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom
- Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
- Abstract: The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduce HASARD (A Benchmark for HArnessing SAfe *...
-
ImagineNav: Prompting Vision-Language Models as Embodied Navigator through Scene Imagination
- Authors: Xinxin Zhao, Wenzhe Cai, Likun Tang, Teng Wang
- Abstract: Visual navigation is an essential skill for home-assistance robots, providing the object-searching ability to accomplish long-horizon daily tasks. Many recent approaches use Large Language Models (LLMs) for commonsense inference to improve exploration efficiency. However, the planning process of LLM...
-
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
- Authors: Cong Lu, Shengran Hu, Jeff Clune
- Abstract: Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challen...
-
Latent Action Pretraining from Videos
- Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
- Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators...
-
Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
- Authors: Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, Di Hu
- Abstract: Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to precisely understand and manipulate objects. Over time, numerous meticulously designed visuo-tactile sensors have been integrated into robotic systems, aiding in completing various tasks. However, the distinct data cha...
-
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo
- Abstract: LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations ...
-
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
- Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, DAE SHIK KIM
- Abstract: The fusion of Large Language Models (LLMs) with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. How...
-
Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy
- Authors: Mingyang Zhao, Gaofeng Meng, Dong-ming Yan
- Abstract: Non-rigid alignment of point clouds is crucial for scene understanding, reconstruction, and various computer vision and robotics tasks. Recent advancements in implicit deformation networks for non-rigid registration have significantly reduced the reliance on large amounts of annotated training data....
-
Online Neuro-Symbolic Predicate Invention for High-Level Planning
- Authors: Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, Joao F. Henriques, Kevin Ellis
- Abstract: Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the st...
-
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
- Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John Turner, Eric Undersander, Tsung-Yen Yang
- Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We empl...
-
PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning
- Authors: Utsav Singh, Vinay Namboodiri
- Abstract: Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-p...
-
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
- Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang
- Abstract: Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representati...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Sensor-Invariant Tactile Representation
- Authors: Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
- Abstract: High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This lim...
-
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction
- Authors: Yang Zhou, Hao Shao, Letian Wang, Steven Waslander, Hongsheng Li, Yu Liu
- Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limitin...
-
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
- Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He
- Abstract: In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understandi...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
Student-Informed Teacher Training
- Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza
- Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limit...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
The Value of Sensory Information to a Robot
- Authors: Arjun Krishna, Edward Hu, Dinesh Jayaraman
- Abstract: A decision-making agent, such as a robot, must observe and react to any new task-relevant information that becomes available from its environment. We seek to study a fundamental scientific question: what value does sensory information hold to an agent at various moments in time during the execution ...
-
TopoGaussian: Inferring Internal Topology Structures from Visual Clues
- Authors: Xiaoyu Xiong, Changyu Hu, Chunru Lin, Pingchuan Ma, Chuang Gan, Tao Du
- Abstract: We present TopoGaussian, a holistic, particle-based pipeline for inferring the interior structure of an opaque object from easily accessible photos and videos as input. Traditional mesh-based approaches require tedious and error-prone mesh filling and fixing process, while typically output rough bou...
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- Authors: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
- Abstract: Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. ...
-
Vision Language Models are In-Context Value Learners
- Authors: Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, Fei Xia
- Abstract: Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can s...
-
VisualAgentBench: Towards Large Multimodal Models as Visual Agents
- Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable visual agents that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase t...
-
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
- Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang
- Abstract: Vision-language-action models (VLAs) have recently become highly prevalent in robot manipulation due to its end-to-end architecture and impressive performance. However, current VLAs are limited to processing human instructions in textual form, neglecting the more natural speech modality for human in...
-
- Authors: Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye
- Abstract: Vision and touch are the most commonly used senses in human manipulation. While leveraging human manipulation videos for robotic task pretraining has shown promise in prior works, it is limited to image and language modalities and deployment to simple parallel grippers. In this paper, aiming to addr...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
- Authors: Xinyan Chen, Jianfei Yang
- Abstract: Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiD...
-
X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
- Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
- Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence.In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and t...
-
3D Vision-Language Gaussian Splatting
- Authors: Qucheng Peng, Benjamin Planche, Zhongpai Gao, Meng Zheng, Anwesa Choudhuri, Terrence Chen, Chen Chen, Ziyan Wu
- Abstract: Recent advancements in 3D reconstruction methods and vision-language models have propelled the development of multi-modal 3D scene understanding, which has vital applications in robotics, autonomous driving, and virtual/augmented reality. However, current multi-modal scene understanding approaches h...
-
ADAM: An Embodied Causal Agent in Open-World Environments
- Authors: Shu Yu, Chaochao Lu
- Abstract: In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their inter...
-
AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning
- Authors: Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, Hao Dong
- Abstract: Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a...
-
A deep inverse-mapping model for a flapping robotic wing
- Authors: Hadar Sharvit, Raz Karl, Tsevi Beatus
- Abstract: In systems control, the dynamics of a system are governed by modulating its inputs to achieve a desired outcome. For example, to control the thrust of a quad-copter propeller the controller modulates its rotation rate, relying on a straightforward mapping between the input rotation rate and the resu...
-
A Distributional Approach to Uncertainty-Aware Preference Alignment Using Offline Demonstrations
- Authors: Sheng Xu, Bo Yue, Hongyuan Zha, Guiliang Liu
- Abstract: Designing reward functions in Reinforcement Learning (RL) often demands significant task-specific expertise. Offline preference-based Reinforcement Learning (PbRL) provides an effective alternative to address the complexity of reward design by learning policies from offline datasets that contain hum...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
Behavioral Entropy-Guided Dataset Generation for Offline Reinforcement Learning
- Authors: Wesley Suttle, Aamodh Suresh, Carlos Nieto-Granda
- Abstract: Entropy-based objectives are widely used to perform state space exploration in reinforcement learning (RL) and dataset generation for offline RL. Behavioral entropy (BE), a rigorous generalization of classical entropies that incorporates cognitive and perceptual biases of agents, was recently propos...
-
Bidirectional Decoding: Improving Action Chunking via Closed-Loop Resampling
- Authors: Yuejiang Liu, Jubayer Hamid, Annie Xie, Yoonho Lee, Max Du, Chelsea Finn
- Abstract: Predicting and executing a sequence of actions without intermediate replanning, known as action chunking, is increasingly used in robot learning from human demonstrations. Yet, its reported effects on the learned policy are inconsistent: some studies find it crucial for achieving strong results, whi...
-
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
- Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
- Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the dat...
-
Breaking Neural Network Scaling Laws with Modularity
- Authors: Akhilan Boopathy, Sunshine Jiang, William Yue, Jaedong Hwang, Abhiram Iyer, Ila Fiete
- Abstract: Modular neural networks outperform nonmodular neural networks on tasks ranging from visual question answering to robotics. These performance improvements are thought to be due to modular networks' superior ability to model the compositional and combinatorial structure of real-world problems. However...
-
CaPo: Cooperative Plan Optimization for Efficient Embodied Multi-Agent Cooperation
- Authors: Jie Liu, Pan Zhou, Yingjun Du, Ah-Hwee Tan, Cees G Snoek, Jan-jakob Sonke, Efstratios Gavves
- Abstract: In this work, we address the cooperation problem among large language model (LLM) based embodied agents, where agents must cooperate to achieve a common goal. Previous methods often execute actions extemporaneously and incoherently, without long-term strategic and cooperative planning, leading to r...
-
COMBO: Compositional World Models for Embodied Multi-Agent Cooperation
- Authors: Hongxin Zhang, Zeyuan Wang, Qiushi Lyu, Zheyuan Zhang, Sunli Chen, Tianmin Shu, Yilun Du, Behzad Dariush, Kwonjoon Lee, Chuang Gan
- Abstract: In this paper, we investigate the problem of embodied multi-agent cooperation, where decentralized agents must cooperate given only egocentric views of the world. To effectively plan in this setting, in contrast to learning world dynamics in a single-agent scenario, we must simulate world dynamics c...
-
Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback
- Authors: Michelle Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy
- Abstract: In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagree...
-
Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery
- Authors: Amin Soleimani Abyaneh, Mahrokh Boroujeni, Hsiu-Chin Lin, Giancarlo Ferrari-Trecate
- Abstract: Imitation learning is a data-driven approach to learning policies from expert behavior, but it is prone to unreliable outcomes in out-of-sample (OOS) regions. While previous research on stable dynamical system policies guarantees convergence to a desired state, it often overlooks transient behavior....
-
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
- Authors: Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu
- Abstract: Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored.In this work, we st...
-
Data Scaling Laws in Imitation Learning for Robotic Manipulation
- Authors: Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, Yang Gao
- Abstract: Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate ...
-
DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from One Demo
- Authors: Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, Huazhe Xu
- Abstract: Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object catego...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Diffusion Policy Policy Optimization
- Authors: Allen Z. Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz
- Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG method...
-
Dobi-SVD: Differential SVD for LLM Compression and Some New Perspectives
- Authors: Qinsi Wang, Jinghan Ke, Masayoshi Tomizuka, Kurt Keutzer, Chenfeng Xu
- Abstract: Large language models (LLMs) have sparked a new wave of AI applications; however, their substantial computational costs and memory demands pose significant challenges to democratizing access to LLMs for a broader audience. Singular Value Decomposition (SVD), a technique studied for decades, offers a...
-
Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
- Authors: Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves
- Abstract: A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hall...
-
- Authors: Wei Hung, Shao-Hua Sun, Ping-Chun Hsieh
- Abstract: Action-constrained reinforcement learning (ACRL) is a generic framework for learning control policies with zero action constraint violation, which is required by various safety-critical and resource-constrained applications. The existing ACRL methods can typically achieve favorable constraint satisf...
-
Efficient Active Imitation Learning with Random Network Distillation
- Authors: Emilien Biré, Anthony Kobanda, Ludovic Denoyer, Rémy Portelas
- Abstract: Developing agents for complex and underspecified tasks, where no clear objective exists, remains challenging but offers many opportunities. This is especially true in video games, where simulated players (bots) need to play realistically, and there is no clear reward to evaluate them. While imitatio...
-
Efficient Diffusion Transformer Policies with Mixture of Expert Denoisers for Multitask Learning
- Authors: Moritz Reuss, Jyothish Pari, Pulkit Agrawal, Rudolf Lioutikov
- Abstract: Diffusion Policies have become widely used in Imitation Learning, offering several appealing properties, such as generating multimodal and discontinuous behavior.As models are becoming larger to capture more complex capabilities, their computational demands increase, as shown by recent scaling laws....
-
Efficient Imitation under Misspecification
- Authors: Nicolas Espinosa Dice, Sanjiban Choudhury, Wen Sun, Gokul Swamy
- Abstract: Interactive imitation learning (IL) is a powerful paradigm for learning to make sequences of decisions from an expert demonstrating how to perform a task. Prior work in efficient imitation learning has focused on the realizable setting, where the expert's policy lies within the learner's policy clas...
-
Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
- Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok
- Abstract: Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this ide...
-
Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping
- Authors: Ziye Huang, Haoqi Yuan, Yuhui Fu, Zongqing Lu
- Abstract: Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task ...
-
ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY
- Authors: Chenrui Tie, Yue Chen, Ruihai Wu, Boxuan Dong, Zeyi Li, Chongkai Gao, Hao Dong
- Abstract: Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks.However, extensive demonstrations are required for policy robustness and generalization.To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient tra...
-
FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm...
-
Following the Human Thread in Social Navigation
- Authors: Luca Scofano, Alessio Sampieri, Tommaso Campari, Valentino Sacco, Indro Spinelli, Lamberto Ballan, Fabio Galasso
- Abstract: The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human traje...
-
FOSP: Fine-tuning Offline Safe Policy through World Models
- Authors: Chenyang Cao, Yucheng Xin, Silang Wu, Longxiang He, Zichen Yan, Junbo Tan, Xueqian Wang
- Abstract: Offline Safe Reinforcement Learning (RL) seeks to address safety constraints by learning from static datasets and restricting exploration. However, these approaches heavily rely on the dataset and struggle to generalize to unseen scenarios safely. In this paper, we aim to improve safety during the d...
-
Generalizable Motion Planning via Operator Learning
- Authors: Sharath Matada, Luke Bhan, Yuanyuan Shi, Nikolay Atanasov
- Abstract: In this work, we introduce a planning neural operator (PNO) for predicting the value function of a motion planning problem. We recast value function approximation as learning a single operator from the cost function space to the value functionspace, which is defined by an Eikonal partial differentia...
-
Generalized Behavior Learning from Diverse Demonstrations
- Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay
- Abstract: Diverse behavior policies are valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and factors that govern individual behavior variations, which can be used to characterize \it{useful...
-
General Scene Adaptation for Vision-and-Language Navigation
- Authors: Haodong Hong, Yanyuan Qiao, Sen Wang, Jiajun Liu, Qi Wu
- Abstract: Vision-and-Language Navigation (VLN) tasks mainly evaluate agents based on one-time execution of individual instructions across multiple environments, aiming to develop agents capable of functioning in any environment in a zero-shot manner. However, real-world navigation robots often operate in pers...
-
Generating Freeform Endoskeletal Robots
- Authors: Muhan Li, Lingji Kong, Sam Kriegman
- Abstract: The automatic design of embodied agents (e.g. robots) has existed for 31 years and is experiencing a renaissance of interest in the literature. To date however, the field has remained narrowly focused on two kinds of anatomically simple robots: (1) fully rigid, jointed bodies; and (2) fully soft, jo...
-
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects
- Authors: Tai Hoang, Huy Le, Philipp Becker, Vien A Ngo, Gerhard Neumann
- Abstract: Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterog...
-
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
- Authors: Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
- Abstract: With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment....
-
GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation
- Authors: Yangtao Chen, Chen, Junhui Yin, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yang Gao
- Abstract: Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in un...
-
GROOT-2: Weakly Supervised Multimodal Instruction Following Agents
- Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
- Abstract: Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset w...
-
HASARD: A Benchmark for Harnessing Safe Reinforcement Learning with Doom
- Authors: Tristan Tomilin, Meng Fang, Mykola Pechenizkiy
- Abstract: The advancement of safe reinforcement learning (RL) faces numerous obstacles, including the lack of simulation environments, demanding computational requirements, and a lack of widely accepted benchmarks. To address these challenges, we introduce HASARD (A Benchmark for HArnessing SAfe *...
-
HGM³: Hierarchical Generative Masked Motion Modeling with Hard Token Mining
- Authors: Minjae Jeong, Yechan Hwang, Jaejin Lee, Sungyoon Jung, Won Hwa Kim
- Abstract: Text-to-motion generation has significant potential in a wide range of applications including animation, robotics, and AR/VR. While recent works on masked motion models are promising, the task remains challenging due to the inherent ambiguity in text and the complexity of human motion dynamics. To o...
-
Instant Policy: In-Context Imitation Learning via Graph Diffusion
- Authors: Vitalis Vosylius, Edward Johns
- Abstract: Following the impressive capabilities of in-context learning with large transformers, In-Context Imitation Learning (ICIL) is a promising opportunity for robotics. We introduce Instant Policy, which learns new tasks instantly from just one or two demonstrations, achieving ICIL through two key compon...
-
Intelligent Go-Explore: Standing on the Shoulders of Giant Foundation Models
- Authors: Cong Lu, Shengran Hu, Jeff Clune
- Abstract: Go-Explore is a powerful family of algorithms designed to solve hard-exploration problems built on the principle of archiving discovered states, and iteratively returning to and exploring from the most promising states. This approach has led to superhuman performance across a wide variety of challen...
-
Joint Reward and Policy Learning with Demonstrations and Human Feedback Improves Alignment
- Authors: Chenliang Li, Siliang Zeng, Zeyi Liao, Jiaxiang Li, Dongyeop Kang, Alfredo Garcia, Mingyi Hong
- Abstract: Aligning to human preferences and/or intentions is an important requirement for contemporary foundation models. To ensure alignment, popular approaches such as reinforcement learning with human feedback (RLHF) break down the task into three stages: (i) a model is computed with supervised fine-tuning...
-
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
- Authors: Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
- Abstract: While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge.In this work, we take a step towards this goal by...
-
Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations
- Authors: Pei Zhou, Ruizhe Liu, Qian Luo, Yibing Song, Fan Wang, Yanchao Yang
- Abstract: Training embodied agents to perform complex robotic tasks presents significant challenges due to the entangled factors of task compositionality, environmental diversity, and dynamic changes. In this work, we introduce a novel imitation learning framework to train closed-loop concept-guided policies ...
-
Learning Geometric Reasoning Networks For Robot Task And Motion Planning
- Authors: Smail Ait Bouhsain, Rachid Alami, Thierry Simeon
- Abstract: Task and Motion Planning (TAMP) is a computationally challenging robotics problem due to the tight coupling of discrete symbolic planning and continuous geometric planning of robot motions. In particular, planning manipulation tasks in complex 3D environments leads to a large number of costly geomet...
-
Learning Unified Static-Dynamic Representation across Multiple Visuo-tactile Sensors
- Authors: Ruoxuan Feng, Jiangyu Hu, Wenke Xia, Tianci Gao, Ao Shen, Yuhao Sun, Bin Fang, Di Hu
- Abstract: Visuo-tactile sensors aim to emulate human tactile perception, enabling robots to precisely understand and manipulate objects. Over time, numerous meticulously designed visuo-tactile sensors have been integrated into robotic systems, aiding in completing various tasks. However, the distinct data cha...
-
Learning View-invariant World Models for Visual Robotic Manipulation
- Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu
- Abstract: Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invarian...
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
- Authors: Calarina Muslimani, Matthew E Taylor
- Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward f...
-
LLaRA: Supercharging Robot Learning Data for Vision-Language Policy
- Authors: Xiang Li, Cristina Mata, Jongwoo Park, Kumara Kahatapitiya, Yoo Jang, Jinghuan Shang, Kanchana Ranasinghe, Ryan Burgert, Mu Cai, Yong Jae Lee, Michael Ryoo
- Abstract: LLMs with visual inputs, i.e., Vision Language Models (VLMs), have the capacity to process state information as visual-textual prompts and respond with policy decisions in text. We propose LLaRA: Large Language and Robotics Assistant, a framework that formulates robot action policy as conversations ...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
- Authors: Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
- Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks w...
-
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
- Authors: Junpeng Yue, Xinrun Xu, Börje Karlsson, Zongqing Lu
- Abstract: MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at han...
-
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dy...
-
- Authors: Xinyue Wang, Biwei Huang
- Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling...
-
Motion Control of High-Dimensional Musculoskeletal System with Hierarchical Model-Based Planning
- Authors: Yunyue Wei, Shanning Zhuang, Vincent Zhuang, Yanan Sui
- Abstract: Controlling high-dimensional nonlinear systems presents significant challenges in biological and robotic applications due to the large state and action spaces. While deep reinforcement learning has emerged as the leading approach, it suffers from computationally-intensive and time-consuming, and are...
-
Multi-Robot Motion Planning with Diffusion Models
- Authors: Yorai Shaoul, Itamar Mishani, Shivam Vats, Jiaoyang Li, Maxim Likhachev
- Abstract: Diffusion models have recently been successfully applied to a wide range of robotics applications for learning complex multi-modal behaviors from data. However, prior works have mostly been confined to single-robot and small-scale environments due to the high sample complexity of learning multi-robo...
-
- Authors: Qixin ZHANG, Zongqi Wan, Yu Yang, Li Shen, Dacheng Tao
- Abstract: Coordinating multiple agents to collaboratively maximize submodular functions in unpredictable environments is a critical task with numerous applications in machine learning, robot planning and control. The existing approaches, such as the OSG algorithm, are often hindered by their poor approximati...
-
Neural Wave Equation for Irregularly Sampled Sequence Data
- Authors: Arkaprava Majumdar, M Krishna, P. K. Srijith
- Abstract: Sequence labeling problems arise in several real-world applications such as healthcare and robotics. In many such applications, sequence data are irregularly sampled and are of varying complexities. Recently, efforts have been made to develop neural ODE-based architectures to model the evolution of ...
-
- Authors: Anish Abhijit Diwan, Julen Urain, Jens Kober, Jan Peters
- Abstract: This paper introduces a new imitation learning framework based on energy-based generative models capable of learning complex, physics-dependent, robot motion policies through state-only expert motion trajectories. Our algorithm, called Noise-conditioned Energy-based Annealed Rewards (NEAR), construc...
-
Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning
- Authors: Caleb Chuck, Fan Feng, Carl Qi, Chang Shi, Siddhant Agarwal, Amy Zhang, Scott Niekum
- Abstract: Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robot...
-
Offline Hierarchical Reinforcement Learning via Inverse Optimization
- Authors: Carolin Schmidt, Daniele Gammelli, James Harrison, Marco Pavone, Filipe Rodrigues
- Abstract: Hierarchical policies enable strong performance in many sequential decision-making problems, such as those with high-dimensional action spaces, those requiring long-horizon planning, and settings with sparse rewards. However, learning hierarchical policies from static offline datasets presents a sig...
-
Online Neuro-Symbolic Predicate Invention for High-Level Planning
- Authors: Yichao Liang, Nishanth Kumar, Hao Tang, Adrian Weller, Joshua B Tenenbaum, Tom Silver, Joao F. Henriques, Kevin Ellis
- Abstract: Broadly intelligent agents should form task-specific abstractions that selectively expose the essential elements of a task, while abstracting away the complexity of the raw sensorimotor space. In this work, we present Neuro-Symbolic Predicates, a first-order abstraction language that combines the st...
-
PEAR: Primitive Enabled Adaptive Relabeling for Boosting Hierarchical Reinforcement Learning
- Authors: Utsav Singh, Vinay Namboodiri
- Abstract: Hierarchical reinforcement learning (HRL) has the potential to solve complex long horizon tasks using temporal abstraction and increased exploration. However, hierarchical agents are difficult to train due to inherent non-stationarity. We present primitive enabled adaptive relabeling (PEAR), a two-p...
-
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning
- Authors: Ruiqi Ni, zherong pan, Ahmed Qureshi
- Abstract: The motion planning problem involves finding a collision-free path from a robot's starting to its target configuration. Recently, self-supervised learning methods have emerged to tackle motion planning problems without requiring expensive expert demonstrations. They solve the Eikonal equation for tr...
-
POGEMA: A Benchmark Platform for Cooperative Multi-Agent Navigation
- Authors: Alexey Skrynnik, Anton Andreychuk, Anatolii Borzilov, Alexander Chernyavskiy, Konstantin Yakovlev, Aleksandr Panov
- Abstract: Multi-agent reinforcement learning (MARL) has recently excelled in solving challenging cooperative and competitive multi-agent problems in various environments with, mostly, few agents and full observability. Moreover, a range of crucial robotics-related tasks, such as multi-robot navigation and obs...
-
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model
- Authors: Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Zhang, Hao Su
- Abstract: Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation l...
-
Predicate Hierarchies Improve Few-Shot State Classification
- Authors: Emily Jin, Joy Hsu, Jiajun Wu
- Abstract: State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desider...
-
PWM: Policy Learning with Multi-Task World Models
- Authors: Ignat Georgiev, Varun Giridhar, Nicklas Hansen, Animesh Garg
- Abstract: Reinforcement Learning (RL) has made significant strides in complex tasks but struggles in multi-task settings with different embodiments. World models methods offer scalability by learning a simulation of the environment, but often rely on inefficient gradient-free optimization methods for policy e...
-
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
- Authors: Joey Hong, Anca Dragan, Sergey Levine
- Abstract: Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to ...
-
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
- Authors: Patrick Yin, Tyler Westenbroek, Ching-An Cheng, Andrey Kolobov, Abhishek Gupta
- Abstract: Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein t...
-
- Authors: Charles Dawson, Van Tran, Max Li, Chuchu Fan
- Abstract: Increased deployment of autonomous systems in fields like transportation and robotics have seen a corresponding increase in safety-critical failures. These failures can be difficult to model and debug due to the relative lack of data: compared to tens of thousands of examples from normal operations,...
-
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
- Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu
- Abstract: Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Tr...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
REGENT: A Retrieval-Augmented Generalist Agent That Can Act In-Context in New Environments
- Authors: Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, Insup Lee
- Abstract: Do generalist agents only require large models pre-trained on massive amounts of data to rapidly adapt to new environments? We propose a novel approach to pre-train relatively small models and adapt them to unseen environments via in-context learning, without any finetuning. Our key idea is that ret...
-
- Authors: Sumeet Singh, Vikas Sindhwani, Stephen Tu
- Abstract: A crucial design decision for any robot learning pipeline is the choice of policy representation: what type of model should be used to generate the next set of robot actions? Owing to the inherent multi-modal nature of many robotic tasks, combined with the recent successes in generative modeling, re...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
- Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu
- Abstract: The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human vi...
-
Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning
- Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos
- Abstract: Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchm...
-
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
- Authors: Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
- Abstract: Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected throu...
-
Select before Act: Spatially Decoupled Action Repetition for Continuous Control
- Authors: Buqing Nie, Yangqing Fu, Yue Gao
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in various continuous control tasks, such as robot manipulation and locomotion.Different to mainstream RL which makes decisions at individual steps, recent studies have incorporated action repetition into RL, achieving enhanced action persi...
-
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction
- Authors: Yang Zhou, Hao Shao, Letian Wang, Steven Waslander, Hongsheng Li, Yu Liu
- Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limitin...
-
Solving New Tasks by Adapting Internet Video Knowledge
- Authors: Calvin Luo, Zilai Zeng, Yilun Du, Chen Sun
- Abstract: Video generative models, beyond enabling the production of astounding visual creations, offer a promising pathway for unlocking novel, text-conditioned robotic behaviors, whether utilized as a video planner or as a policy supervisor. When pretrained on internet-scale datasets, such video models int...
-
SPA: 3D Spatial-Awareness Enables Effective Embodied Representation
- Authors: Haoyi Zhu, Honghui Yang, Yating Wang, Jiange Yang, Limin Wang, Tong He
- Abstract: In this paper, we introduce SPA, a novel representation learning framework that emphasizes the importance of 3D spatial awareness in embodied AI. Our approach leverages differentiable neural rendering on multi-view images to endow a vanilla Vision Transformer (ViT) with intrinsic spatial understandi...
-
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
- Authors: Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, Yashraj Narang
- Abstract: Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made in developing such strategies for general pick-a...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
- Authors: Kaizhe Hu, Zihang Rui, Yao He, Yuyao Liu, Pu Hua, Huazhe Xu
- Abstract: Visual imitation learning methods demonstrate strong performance, yet they lack generalization when faced with visual input perturbations like variations in lighting and textures. This limitation hampers their practical application in real-world settings. To address this, we propose Stem-OB th...
-
STRAP: Robot Sub-Trajectory Retrieval for Augmented Policy Learning
- Authors: Marius Memmel, Jacob Berg, Bingqing Chen, Abhishek Gupta, Jonathan Francis
- Abstract: Robot learning is witnessing a significant increase in the size, diversity, and complexity of pre-collected datasets, mirroring trends in domains such as natural language processing and computer vision. Many robot learning methods treat such datasets as multi-task expert data and learn a multi-task,...
-
Student-Informed Teacher Training
- Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza
- Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limit...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
TOP-ERL: Transformer-based Off-Policy Episodic Reinforcement Learning
- Authors: Ge Li, Dong Tian, Hongyi Zhou, Xinkai Jiang, Rudolf Lioutikov, Gerhard Neumann
- Abstract: This work introduces Transformer-based Off-Policy Episodic Reinforcement Learning (TOP-ERL), a novel algorithm that enables off-policy updates in the ERL framework. In ERL, policies predict entire action trajectories over multiple time steps instead of single actions at every time step. These trajec...
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- Authors: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
- Abstract: Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. ...
-
Vision Language Models are In-Context Value Learners
- Authors: Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, Fei Xia
- Abstract: Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can s...
-
- Authors: Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye
- Abstract: Vision and touch are the most commonly used senses in human manipulation. While leveraging human manipulation videos for robotic task pretraining has shown promise in prior works, it is limited to image and language modalities and deployment to simple parallel grippers. In this paper, aiming to addr...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
What's the Move? Hybrid Imitation Learning via Salient Points
- Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
- Abstract: While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: **S...
-
X-Fi: A Modality-Invariant Foundation Model for Multimodal Human Sensing
- Authors: Xinyan Chen, Jianfei Yang
- Abstract: Human sensing, which employs various sensors and advanced deep learning technologies to accurately capture and interpret human body information, has significantly impacted fields like public security and robotics. However, current human sensing primarily depends on modalities such as cameras and LiD...
-
6D Object Pose Tracking in Internet Videos for Robotic Manipulation
- Authors: Georgy Ponimatkin, Martin Cífka, Tomas Soucek, Médéric Fourmy, Yann Labbé, Vladimir Petrik, Josef Sivic
- Abstract: We seek to extract a temporally consistent 6D pose trajectory of a manipulated object from an Internet instructional video. This is a challenging set-up for current 6D pose estimation methods due to uncontrolled capturing conditions, fine-grained dynamic object motions, and the fact that the exact ...
-
AdaManip: Adaptive Articulated Object Manipulation Environments and Policy Learning
- Authors: Yuanfei Wang, Xiaojie Zhang, Ruihai Wu, Yu Li, Yan Shen, Mingdong Wu, Zhaofeng He, Yizhou Wang, Hao Dong
- Abstract: Articulated object manipulation is a critical capability for robots to perform various tasks in real-world scenarios. Composed of multiple parts connected by joints, articulated objects are endowed with diverse functional mechanisms through complex relative motions. For example, a safe consists of a...
-
AHA: A Vision-Language-Model for Detecting and Reasoning Over Failures in Robotic Manipulation
- Authors: Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, Yijie Guo
- Abstract: Robotic manipulation in open-world settings requires not only task execution but also the ability to detect and learn from failures. While recent advances in vision-language models (VLMs) and large language models (LLMs) have improved robots' spatial reasoning and problem-solving abilities, they sti...
-
- Authors: Long Le, Jason Xie, William Liang, Hung-Ju Wang, Yue Yang, Yecheng Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, ERIC EATON
- Abstract: Interactive 3D simulated objects are crucial in AR/VR, animations, and robotics, driving immersive experiences and advanced automation.However, creating these articulated objects requires extensive human effort and expertise, limiting their broader applications. To overcome this challenge, we presen...
-
BaB-ND: Long-Horizon Motion Planning with Branch-and-Bound and Neural Dynamics
- Authors: Keyi Shen, Jiangwei Yu, Jose Barreiros, Huan Zhang, Yunzhu Li
- Abstract: Neural-network-based dynamics models learned from observational data have shown strong predictive capabilities for scene dynamics in robotic manipulation tasks. However, their inherent non-linearity presents significant challenges for effective planning. Current planning methods, often dependent on ...
-
- Authors: Ruochen Jiao, Shaoyuan Xie, Justin Yue, TAKAMI SATO, Lixu Wang, Yixuan Wang, Qi Alfred Chen, Qi Zhu
- Abstract: Large Language Models (LLMs) have shown significant promise in real-world decision-making tasks for embodied artificial intelligence, especially when fine-tuned to leverage their inherent common sense and reasoning abilities while being tailored to specific applications. However, this fine-tuning pr...
-
Conformalized Interactive Imitation Learning: Handling Expert Shift and Intermittent Feedback
- Authors: Michelle Zhao, Henny Admoni, Reid Simmons, Aaditya Ramdas, Andrea Bajcsy
- Abstract: In interactive imitation learning (IL), uncertainty quantification offers a way for the learner (i.e. robot) to contend with distribution shifts encountered during deployment by actively seeking additional feedback from an expert (i.e. human) online. Prior works use mechanisms like ensemble disagree...
-
Contractive Dynamical Imitation Policies for Efficient Out-of-Sample Recovery
- Authors: Amin Soleimani Abyaneh, Mahrokh Boroujeni, Hsiu-Chin Lin, Giancarlo Ferrari-Trecate
- Abstract: Imitation learning is a data-driven approach to learning policies from expert behavior, but it is prone to unreliable outcomes in out-of-sample (OOS) regions. While previous research on stable dynamical system policies guarantees convergence to a desired state, it often overlooks transient behavior....
-
Cross-Embodiment Dexterous Grasping with Reinforcement Learning
- Authors: Haoqi Yuan, Bohan Zhou, Yuhui Fu, Zongqing Lu
- Abstract: Dexterous hands exhibit significant potential for complex real-world grasping tasks. While recent studies have primarily focused on learning policies for specific robotic hands, the development of a universal policy that controls diverse dexterous hands remains largely unexplored.In this work, we st...
-
Data Scaling Laws in Imitation Learning for Robotic Manipulation
- Authors: Fanqi Lin, Yingdong Hu, Pingyue Sheng, Chuan Wen, Jiacheng You, Yang Gao
- Abstract: Data scaling has revolutionized fields like natural language processing and computer vision, providing models with remarkable generalization capabilities. In this paper, we investigate whether similar data scaling laws exist in robotics, particularly in robotic manipulation, and whether appropriate ...
-
DenseMatcher: Learning 3D Semantic Correspondence for Category-Level Manipulation from One Demo
- Authors: Junzhe Zhu, Yuanchen Ju, Junyi Zhang, Muhan Wang, Zhecheng Yuan, Kaizhe Hu, Huazhe Xu
- Abstract: Dense 3D correspondence can enhance robotic manipulation by enabling the generalization of spatial, functional, and dynamic information from one object to an unseen counterpart. Compared to shape correspondence, semantic correspondence is more effective in generalizing across different object catego...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Diffusion Policy Policy Optimization
- Authors: Allen Z. Ren, Justin Lidard, Lars Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, Max Simchowitz
- Abstract: We introduce Diffusion Policy Policy Optimization, DPPO, an algorithmic framework including best practices for fine-tuning diffusion-based policies (e.g. Diffusion Policy) in continuous control and robot learning tasks using the policy gradient (PG) method from reinforcement learning (RL). PG method...
-
Direct Multi-agent Motion Generation Preference Alignment with Implicit Feedback from Demonstrations
- Authors: Thomas Tian, Kratarth Goel
- Abstract: Recent advancements in Large Language Models (LLMs) have transformed motion generation models in embodied applications such as autonomous driving and robotic manipulation. While LLM-type motion models benefit from scalability and efficient formulation, there remains a discrepancy between their token...
-
Dream to Manipulate: Compositional World Models Empowering Robot Imitation Learning with Imagination
- Authors: Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, Efstratios Gavves
- Abstract: A world model provides an agent with a representation of its environment, enabling it to predict the causal consequences of its actions. Current world models typically cannot directly and explicitly imitate the actual environment in front of a robot, often resulting in unrealistic behaviors and hall...
-
Efficient Residual Learning with Mixture-of-Experts for Universal Dexterous Grasping
- Authors: Ziye Huang, Haoqi Yuan, Yuhui Fu, Zongqing Lu
- Abstract: Universal dexterous grasping across diverse objects presents a fundamental yet formidable challenge in robot learning. Existing approaches using reinforcement learning (RL) to develop policies on extensive object datasets face critical limitations, including complex curriculum design for multi-task ...
-
EMOS: Embodiment-aware Heterogeneous Multi-robot Operating System with LLM Agents
- Authors: Junting Chen, Checheng Yu, Xunzhe Zhou, Tianqi Xu, Yao Mu, Mengkang Hu, Wenqi Shao, Yikai Wang, Guohao Li, Lin Shao
- Abstract: Heterogeneous multi-robot systems (HMRS) have emerged as a powerful ap-proach for tackling complex tasks that single robots cannot manage alone. Currentlarge-language-model-based multi-agent systems (LLM-based MAS) have shownsuccess in areas like software development and operating systems, but apply...
-
ET-SEED: EFFICIENT TRAJECTORY-LEVEL SE(3) EQUIVARIANT DIFFUSION POLICY
- Authors: Chenrui Tie, Yue Chen, Ruihai Wu, Boxuan Dong, Zeyi Li, Chongkai Gao, Hao Dong
- Abstract: Imitation learning, e.g., diffusion policy, has been proven effective in various robotic manipulation tasks.However, extensive demonstrations are required for policy robustness and generalization.To reduce the demonstration reliance, we leverage spatial symmetry and propose ET-SEED, an efficient tra...
-
FLIP: Flow-Centric Generative Planning for General-Purpose Manipulation Tasks
- Authors: Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Cai Zhehao, Lin Shao
- Abstract: We aim to develop a model-based planning framework for world models that can be scaled with increasing model and data budgets for general-purpose manipulation tasks with only language and vision inputs. To this end, we present FLow-CentrIc generative Planning (FLIP), a model-based planning algorithm...
-
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects
- Authors: Tai Hoang, Huy Le, Philipp Becker, Vien A Ngo, Gerhard Neumann
- Abstract: Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterog...
-
GEVRM: Goal-Expressive Video Generation Model For Robust Visual Manipulation
- Authors: Hongyin Zhang, Pengxiang Ding, Shangke Lyu, Ying Peng, Donglin Wang
- Abstract: With the rapid development of embodied artificial intelligence, significant progress has been made in vision-language-action (VLA) models for general robot decision-making. However, the majority of existing VLAs fail to account for the inevitable external perturbations encountered during deployment....
-
GravMAD: Grounded Spatial Value Maps Guided Action Diffusion for Generalized 3D Manipulation
- Authors: Yangtao Chen, Chen, Junhui Yin, Jing Huo, Pinzhuo Tian, Jieqi Shi, Yang Gao
- Abstract: Robots' ability to follow language instructions and execute diverse 3D tasks is vital in robot learning. Traditional imitation learning-based methods perform well on seen tasks but struggle with novel, unseen ones due to variability. Recent approaches leverage large foundation models to assist in un...
-
GROOT-2: Weakly Supervised Multimodal Instruction Following Agents
- Authors: Shaofei Cai, Bowei Zhang, Zihao Wang, Haowei Lin, Xiaojian Ma, Anji Liu, Yitao Liang
- Abstract: Developing agents that can follow multimodal instructions remains a fundamental challenge in robotics and AI. Although large-scale pre-training on unlabeled datasets has enabled agents to learn diverse behaviors, these agents often struggle with following instructions. While augmenting the dataset w...
-
HAMSTER: Hierarchical Action Models for Open-World Robot Manipulation
- Authors: Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Caelan Garrett, Fabio Ramos, Dieter Fox, Anqi Li, Abhishek Gupta, Ankit Goyal
- Abstract: Large models have shown strong open-world generalization to complex problems in vision and language, but they have been relatively more difficult to deploy in robotics. This challenge stems from several factors, the foremost of which is the lack of scalable robotic training data since this requires ...
-
Kinetix: Investigating the Training of General Agents through Open-Ended Physics-Based Control Tasks
- Authors: Michael Matthews, Michael Beukman, Chris Lu, Jakob Foerster
- Abstract: While large models trained with self-supervised learning on offline datasets have shown remarkable capabilities in text and image domains, achieving the same generalisation for agents that act in sequential decision problems remains an open challenge.In this work, we take a step towards this goal by...
-
Language Guided Skill Discovery
- Authors: Seungeun Rho, Laura Smith, Tianyu Li, Sergey Levine, Xue Bin Peng, Sehoon Ha
- Abstract: Skill discovery methods enable agents to learn diverse emergent behaviors without explicit rewards. To make learned skills useful for downstream tasks, obtaining a semantically diverse repertoire of skills is crucial. While some approaches use discriminators to acquire distinguishable skills and oth...
-
Latent Action Pretraining from Videos
- Authors: Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, Minjoon Seo
- Abstract: We introduce Latent Action Pretraining for general Action models (LAPA), the first unsupervised method for pretraining Vision-Language-Action (VLA) models without ground-truth robot action labels. Existing Vision-Language-Action models require action labels typically collected by human teleoperators...
-
Learning Closed-Loop Concept-Guided Policies from Unlabeled Demonstrations
- Authors: Pei Zhou, Ruizhe Liu, Qian Luo, Yibing Song, Fan Wang, Yanchao Yang
- Abstract: Training embodied agents to perform complex robotic tasks presents significant challenges due to the entangled factors of task compositionality, environmental diversity, and dynamic changes. In this work, we introduce a novel imitation learning framework to train closed-loop concept-guided policies ...
-
Learning Geometric Reasoning Networks For Robot Task And Motion Planning
- Authors: Smail Ait Bouhsain, Rachid Alami, Thierry Simeon
- Abstract: Task and Motion Planning (TAMP) is a computationally challenging robotics problem due to the tight coupling of discrete symbolic planning and continuous geometric planning of robot motions. In particular, planning manipulation tasks in complex 3D environments leads to a large number of costly geomet...
-
Learning View-invariant World Models for Visual Robotic Manipulation
- Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu
- Abstract: Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invarian...
-
ManiSkill-HAB: A Benchmark for Low-Level Manipulation in Home Rearrangement Tasks
- Authors: Arth Shukla, Stone Tao, Hao Su
- Abstract: High-quality benchmarks are the foundation for embodied AI research, enabling significant advancements in long-horizon navigation, manipulation and rearrangement tasks. However, as frontier tasks in robotics get more advanced, they require faster simulation speed, more intricate test environments, a...
-
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dy...
-
- Authors: Xinyue Wang, Biwei Huang
- Abstract: Generalization in reinforcement learning (RL) remains a significant challenge, especially when agents encounter novel environments with unseen dynamics. Drawing inspiration from human compositional reasoning—where known components are reconfigured to handle new situations—we introduce World Modeling...
-
Occlusion-aware Non-Rigid Point Cloud Registration via Unsupervised Neural Deformation Correntropy
- Authors: Mingyang Zhao, Gaofeng Meng, Dong-ming Yan
- Abstract: Non-rigid alignment of point clouds is crucial for scene understanding, reconstruction, and various computer vision and robotics tasks. Recent advancements in implicit deformation networks for non-rigid registration have significantly reduced the reliance on large amounts of annotated training data....
-
Physics-informed Temporal Difference Metric Learning for Robot Motion Planning
- Authors: Ruiqi Ni, zherong pan, Ahmed Qureshi
- Abstract: The motion planning problem involves finding a collision-free path from a robot's starting to its target configuration. Recently, self-supervised learning methods have emerged to tackle motion planning problems without requiring expensive expert demonstrations. They solve the Eikonal equation for tr...
-
Predicate Hierarchies Improve Few-Shot State Classification
- Authors: Emily Jin, Joy Hsu, Jiajun Wu
- Abstract: State classification of objects and their relations is core to many long-horizon tasks, particularly in robot planning and manipulation. However, the combinatorial explosion of possible object-predicate combinations, coupled with the need to adapt to novel real-world environments, makes it a desider...
-
Predictive Inverse Dynamics Models are Scalable Learners for Robotic Manipulation
- Authors: Yang Tian, Sizhe Yang, Jia Zeng, Ping Wang, Dahua Lin, Hao Dong, Jiangmiao Pang
- Abstract: Current efforts to learn scalable policies in robotic manipulation primarily fall into two categories: one focuses on "action," which involves behavior cloning from extensive collections of robotic data, while the other emphasizes "vision," enhancing model generalization by pre-training representati...
-
Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
- Authors: Joey Hong, Anca Dragan, Sergey Levine
- Abstract: Value-based reinforcement learning (RL) can in principle learn effective policies for a wide range of multi-turn problems, from games to dialogue to robotic control, including via offline RL from static previously collected datasets. However, despite the widespread use of policy gradient methods to ...
-
Rapidly Adapting Policies to the Real-World via Simulation-Guided Fine-Tuning
- Authors: Patrick Yin, Tyler Westenbroek, Ching-An Cheng, Andrey Kolobov, Abhishek Gupta
- Abstract: Robot learning requires a considerable amount of data to realize the promise of generalization. However, it can be challenging to actually collect the magnitude of high-quality data necessary for generalization entirely in the real world. Simulation can serve as a source of plentiful data, wherein t...
-
RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation
- Authors: Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, Jun Zhu
- Abstract: Bimanual manipulation is essential in robotics, yet developing foundation models is extremely challenging due to the inherent complexity of coordinating two robot arms (leading to multi-modal action distributions) and the scarcity of training data. In this paper, we present the Robotics Diffusion Tr...
-
ReGen: Generative Robot Simulation via Inverse Design
- Authors: Peter (Phat) Nguyen, Johnson (Tsun-Hsuan) Wang, Zhang-Wei Hong, Erfan Aasi, Andrew Silva, Guy Rosman, Sertac Karaman, Daniela Rus
- Abstract: Simulation plays a key role in scaling robot learning and validating policies, but constructing simulations remains labor-intensive. In this paper, we introduce ReGen, a generative simulation framework that automates this process using inverse design. Given an agent's behavior (such as a motion traj...
-
RoboCat: A Self-Improving Generalist Agent for Robotic Manipulation
- Authors: Sergio Gómez Colmenarejo, Jost Springenberg, Jose Enrique Chen, Jonathan Scholz, Raia Hadsell, Claudio Fantacci, Alex Lee, Maria Bauza Villalonga, Yuxiang Zhou, Dushyant Rao, Akhil Raju, Antoine Laurens, Murilo Fernandes Martins, Rugile Pevceviciute, Michiel Blokzijl, Nathan Batchelor, Konrad Zolna, Thomas Lampe, Agrim Gupta, Scott Reed, Abbas Abdolmaleki, David Barker, Joy Ortiz, Martin Riedmiller, Jean-Baptiste Regli, Nicolas Heess, Francesco Nori, Todor Davchev, Oleg O Sushkov, Thomas Rothörl, Misha Denil, Emilio Parisotto, Valentin Dalibard, Martina Zambelli, Yusuf Aytar, Giulia Vezzani, Coline Devin, Oliver Groth, Konstantinos Bousmalis
- Abstract: The ability to leverage heterogeneous robotic experience from different robots and tasks to quickly master novel skills and embodiments has the potential to transform robot learning. Inspired by recent advances in foundation models for vision and language, we propose a multi-embodiment, multi-task g...
-
Robots Pre-train Robots: Manipulation-Centric Robotic Representation from Large-Scale Robot Dataset
- Authors: Guangqi Jiang, Yifei Sun, Tao Huang, Huanyu Li, Yongyuan Liang, Huazhe Xu
- Abstract: The pre-training of visual representations has enhanced the efficiency of robot learning. Due to the lack of large-scale in-domain robotic datasets, prior works utilize in-the-wild human videos to pre-train robotic visual representation. Despite their promising results, representations from human vi...
-
SafeDiffuser: Safe Planning with Diffusion Probabilistic Models
- Authors: Wei Xiao, Johnson (Tsun-Hsuan) Wang, Chuang Gan, Ramin Hasani, Mathias Lechner, Daniela Rus
- Abstract: Diffusion models have shown promise in data-driven planning. While these planners are commonly employed in applications where decisions are critical, they still lack established safety guarantees. In this paper, we address this limitation by introducing SafeDiffuser, a method to equip diffusion mode...
-
Scalable Decision-Making in Stochastic Environments through Learned Temporal Abstraction
- Authors: Baiting Luo, Ava Pettet, Aron Laszka, Abhishek Dubey, Ayan Mukhopadhyay
- Abstract: Sequential decision-making in high-dimensional continuous action spaces, particularly in stochastic environments, faces significant computational challenges. We explore this challenge in the traditional offline RL setting, where an agent must learn how to make decisions based on data collected throu...
-
Select before Act: Spatially Decoupled Action Repetition for Continuous Control
- Authors: Buqing Nie, Yangqing Fu, Yue Gao
- Abstract: Reinforcement Learning (RL) has achieved remarkable success in various continuous control tasks, such as robot manipulation and locomotion.Different to mainstream RL which makes decisions at individual steps, recent studies have incorporated action repetition into RL, achieving enhanced action persi...
-
Sensor-Invariant Tactile Representation
- Authors: Harsh Gupta, Yuchen Mo, Shengmiao Jin, Wenzhen Yuan
- Abstract: High-resolution tactile sensors have become critical for embodied perception and robotic manipulation. However, a key challenge in the field is the lack of transferability between sensors due to design and manufacturing variations, which result in significant differences in tactile signals. This lim...
-
SRSA: Skill Retrieval and Adaptation for Robotic Assembly Tasks
- Authors: Yijie Guo, Bingjie Tang, Iretiayo Akinola, Dieter Fox, Abhishek Gupta, Yashraj Narang
- Abstract: Enabling robots to learn novel tasks in a data-efficient manner is a long-standing challenge. Common strategies involve carefully leveraging prior experiences, especially transition data collected on related tasks. Although much progress has been made in developing such strategies for general pick-a...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
Student-Informed Teacher Training
- Authors: Nico Messikommer, Jiaxu Xing, Elie Aljalbout, Davide Scaramuzza
- Abstract: Imitation learning with a privileged teacher has proven effective for learning complex control behaviors from high-dimensional inputs, such as images. In this framework, a teacher is trained with privileged task information, while a student tries to predict the actions of the teacher with more limit...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
- Authors: Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
- Abstract: Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. ...
-
Vision Language Models are In-Context Value Learners
- Authors: Yecheng Jason Ma, Joey Hejna, Chuyuan Fu, Dhruv Shah, Jacky Liang, Zhuo Xu, Sean Kirmani, Peng Xu, Danny Driess, Ted Xiao, Osbert Bastani, Dinesh Jayaraman, Wenhao Yu, Tingnan Zhang, Dorsa Sadigh, Fei Xia
- Abstract: Predicting temporal progress from visual trajectories is important for intelligent robots that can learn, adapt, and improve. However, learning such progress estimator, or temporal value function, across different tasks and domains requires both a large amount of diverse data and methods which can s...
-
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
- Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang
- Abstract: Vision-language-action models (VLAs) have recently become highly prevalent in robot manipulation due to its end-to-end architecture and impressive performance. However, current VLAs are limited to processing human instructions in textual form, neglecting the more natural speech modality for human in...
-
- Authors: Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, Qi Ye
- Abstract: Vision and touch are the most commonly used senses in human manipulation. While leveraging human manipulation videos for robotic task pretraining has shown promise in prior works, it is limited to image and language modalities and deployment to simple parallel grippers. In this paper, aiming to addr...
-
What Matters in Learning from Large-Scale Datasets for Robot Manipulation
- Authors: Vaibhav Saxena, Matthew Bronars, Nadun Ranawaka Arachchige, Kuancheng Wang, Woo Shin, Soroush Nasiriany, Ajay Mandlekar, Danfei Xu
- Abstract: Imitation learning from large multi-task demonstration datasets has emerged as a promising path for building generally-capable robots. As a result, 1000s of hours have been spent on building such large-scale datasets around the globe. Despite the continuous growth of such efforts, we still lack a sy...
-
What's the Move? Hybrid Imitation Learning via Salient Points
- Authors: Priya Sundaresan, Hengyuan Hu, Quan Vuong, Jeannette Bohg, Dorsa Sadigh
- Abstract: While imitation learning (IL) offers a promising framework for teaching robots various behaviors, learning complex tasks remains challenging. Existing IL policies struggle to generalize effectively across visual and spatial variations even for simple tasks. In this work, we introduce SPHINX: **S...
-
- Authors: Hengshuo Chu, Xiang Deng, Qi Lv, Xiaoyang Chen, Yinchuan Li, Jianye HAO, Liqiang Nie
- Abstract: 3D Affordance detection is a challenging problem with broad applications on various robotic tasks. Existing methods typically formulate the detection paradigm as a label-based semantic segmentation task.This paradigm relies on predefined labels and lacks the ability to comprehend complex natural lan...
-
ADAM: An Embodied Causal Agent in Open-World Environments
- Authors: Shu Yu, Chaochao Lu
- Abstract: In open-world environments like Minecraft, existing agents face challenges in continuously learning structured knowledge, particularly causality. These challenges stem from the opacity inherent in black-box models and an excessive reliance on prior knowledge during training, which impair their inter...
-
AgentOccam: A Simple Yet Strong Baseline for LLM-Based Web Agents
- Authors: Ke Yang, Yao Liu, Sapana Chaudhary, Rasool Fakoor, Pratik A Chaudhari, George Karypis, Huzefa Rangwala
- Abstract: Autonomy via agents based on large language models (LLMs) that can carry out personalized yet standardized tasks presents a significant opportunity to drive human efficiency. There is an emerging need and interest in automating web tasks (e.g., booking a hotel for a given date within a budget). Bei...
-
Bootstrapping Language-Guided Navigation Learning with Self-Refining Data Flywheel
- Authors: Zun Wang, Jialu Li, Yicong Hong, Songze Li, Kunchang Li, Shoubin Yu, Yi Wang, Yu Qiao, Yali Wang, Mohit Bansal, Limin Wang
- Abstract: Creating high-quality data for training robust language-instructed agents is a long-lasting challenge in embodied AI. In this paper, we introduce a Self-Refining Data Flywheel (SRDF) that generates high-quality and large-scale navigational instruction-trajectory pairs by iteratively refining the dat...
-
DenseGrounding: Improving Dense Language-Vision Semantics for Ego-centric 3D Visual Grounding
- Authors: Henry Zheng, Hao Shi, Qihang Peng, Yong Xien Chng, Rui Huang, Yepeng Weng, zhongchao shi, Gao Huang
- Abstract: Enabling intelligent agents to comprehend and interact with 3D environments through natural language is crucial for advancing robotics and human-computer interaction. A fundamental task in this field is ego-centric 3D visual grounding, where agents locate target objects in real-world 3D spaces based...
-
- Authors: Xueyi Liu, Jianibieke Adalibieke, Qianwei Han, Yuzhe Qin, Li Yi
- Abstract: We address the challenge of developing a generalizable neural tracking controller for dexterous manipulation from human references. This controller aims to manage a dexterous robot hand to manipulate diverse objects for various purposes defined by kinematic human-object interactions. Developing such...
-
Efficient Model-Based Reinforcement Learning Through Optimistic Thompson Sampling
- Authors: Jasmine Bayrooti, Carl Ek, Amanda Prorok
- Abstract: Learning complex robot behavior through interactions with the environment necessitates principled exploration. Effective strategies should prioritize exploring regions of the state-action space that maximize rewards, with optimistic exploration emerging as a promising direction aligned with this ide...
-
Following the Human Thread in Social Navigation
- Authors: Luca Scofano, Alessio Sampieri, Tommaso Campari, Valentino Sacco, Indro Spinelli, Lamberto Ballan, Fabio Galasso
- Abstract: The success of collaboration between humans and robots in shared environments relies on the robot's real-time adaptation to human motion. Specifically, in Social Navigation, the agent should be close enough to assist but ready to back up to let the human move freely, avoiding collisions. Human traje...
-
Generalized Behavior Learning from Diverse Demonstrations
- Authors: Varshith Sreeramdass, Rohan Paleja, Letian Chen, Sanne van Waveren, Matthew Gombolay
- Abstract: Diverse behavior policies are valuable in domains requiring quick test-time adaptation or personalized human-robot interaction. Human demonstrations provide rich information regarding task objectives and factors that govern individual behavior variations, which can be used to characterize \it{useful...
-
Geometry-aware RL for Manipulation of Varying Shapes and Deformable Objects
- Authors: Tai Hoang, Huy Le, Philipp Becker, Vien A Ngo, Gerhard Neumann
- Abstract: Manipulating objects with varying geometries and deformable objects is a major challenge in robotics. Tasks such as insertion with different objects or cloth hanging require precise control and effective modelling of complex dynamics. In this work, we frame this problem through the lens of a heterog...
-
Internet of Agents: Weaving a Web of Heterogeneous Agents for Collaborative Intelligence
- Authors: Weize Chen, Ziming You, Ran Li, yitong guan, Chen Qian, Chenyang Zhao, Cheng Yang, Ruobing Xie, Zhiyuan Liu, Maosong Sun
- Abstract: The rapid advancement of large language models (LLMs) has paved the way for the development of highly capable autonomous agents. However, existing multi-agent frameworks often struggle with integrating diverse capable third-party agents due to reliance on agents defined within their own ecosystems. ...
-
Learning View-invariant World Models for Visual Robotic Manipulation
- Authors: Jing-Cheng Pang, Nan Tang, Kaiyuan Li, Yuting Tang, Xin-Qiang Cai, Zhen-Yu Zhang, Gang Niu, Masashi Sugiyama, Yang Yu
- Abstract: Robotic manipulation tasks often rely on visual inputs from cameras to perceive the environment. However, previous approaches still suffer from performance degradation when the camera’s viewpoint changes during manipulation. In this paper, we propose ReViWo (Representation learning for View-invarian...
-
Leveraging Sub-Optimal Data for Human-in-the-Loop Reinforcement Learning
- Authors: Calarina Muslimani, Matthew E Taylor
- Abstract: To create useful reinforcement learning (RL) agents, step zero is to design a suitable reward function that captures the nuances of the task. However, reward engineering can be a difficult and time-consuming process. Instead, human-in-the-loop (HitL) RL methods hold the promise of learning reward f...
-
LocoVR: Multiuser Indoor Locomotion Dataset in Virtual Reality
- Authors: Kojiro Takeyama, Yimeng Liu, Misha Sra
- Abstract: Understanding human locomotion is crucial for AI agents such as robots, particularly in complex indoor home environments. Modeling human trajectories in these spaces requires insight into how individuals maneuver around physical obstacles and manage social navigation dynamics. These dynamics include...
-
MetaUrban: An Embodied AI Simulation Platform for Urban Micromobility
- Authors: Wayne Wu, Honglin He, Jack He, Yiran Wang, Chenda Duan, Zhizheng Liu, Quanyi Li, Bolei Zhou
- Abstract: Public urban spaces like streetscapes and plazas serve residents and accommodate social life in all its vibrant variations. Recent advances in Robotics and Embodied AI make public urban spaces no longer exclusive to humans. Food delivery bots and electric wheelchairs have started sharing sidewalks w...
-
MLLM as Retriever: Interactively Learning Multimodal Retrieval for Embodied Agents
- Authors: Junpeng Yue, Xinrun Xu, Börje Karlsson, Zongqing Lu
- Abstract: MLLM agents demonstrate potential for complex embodied tasks by retrieving multimodal task-relevant trajectory data. However, current retrieval methods primarily focus on surface-level similarities of textual or visual cues in trajectories, neglecting their effectiveness for the specific task at han...
-
MMR: A Large-scale Benchmark Dataset for Multi-target and Multi-granularity Reasoning Segmentation
- Authors: Donggon Jang, Yucheol Cho, Suin Lee, Taehyeon Kim, DAE SHIK KIM
- Abstract: The fusion of Large Language Models (LLMs) with vision models is pioneering new possibilities in user-interactive vision-language tasks. A notable application is reasoning segmentation, where models generate pixel-level segmentation masks by comprehending implicit meanings in human instructions. How...
-
Modeling Fine-Grained Hand-Object Dynamics for Egocentric Video Representation Learning
- Authors: Baoqi Pei, Yifei Huang, Jilan Xu, Guo Chen, Yuping He, Lijin Yang, Yali Wang, Weidi Xie, Yu Qiao, Fei Wu, Limin Wang
- Abstract: In egocentric video understanding, the motion of hands and objects as well as their interactions play a significant role by nature.However, existing egocentric video representation learning methods mainly focus on aligning video representation with high-level narrations, overlooking the intricate dy...
-
Null Counterfactual Factor Interactions for Goal-Conditioned Reinforcement Learning
- Authors: Caleb Chuck, Fan Feng, Carl Qi, Chang Shi, Siddhant Agarwal, Amy Zhang, Scott Niekum
- Abstract: Hindsight relabeling is a powerful tool for overcoming sparsity in goal-conditioned reinforcement learning (GCRL). While effective in some domains like navigation and locomotion, hindsight relabeling can struggle in object-centric domains. For example, suppose that the goal space consists of a robot...
-
PARTNR: A Benchmark for Planning and Reasoning in Embodied Multi-agent Tasks
- Authors: Matthew Chang, Gunjan Chhablani, Alexander Clegg, Mikael Dallaire Cote, Ruta Desai, Michal Hlavac, Vladimir Karashchuk, Jacob Krantz, Roozbeh Mottaghi, Priyam Parashar, Siddharth Patki, Ishita Prasad, Xavier Puig, Akshara Rai, Ram Ramrakhya, Daniel Tran, Joanne Truong, John Turner, Eric Undersander, Tsung-Yen Yang
- Abstract: We present a benchmark for Planning And Reasoning Tasks in humaN-Robot collaboration (PARTNR) designed to study human-robot coordination in household activities. PARTNR tasks exhibit characteristics of everyday tasks, such as spatial, temporal, and heterogeneous agent capability constraints. We empl...
-
Policy Decorator: Model-Agnostic Online Refinement for Large Policy Model
- Authors: Xiu Yuan, Tongzhou Mu, Stone Tao, Yunhao Fang, Zhang, Hao Su
- Abstract: Recent advancements in robot learning have used imitation learning with large models and extensive demonstrations to develop effective policies. However, these models are often limited by the quantity quality, and diversity of demonstrations. This paper explores improving offline-trained imitation l...
-
Robotouille: An Asynchronous Planning Benchmark for LLM Agents
- Authors: Gonzalo Gonzalez-Pumariega, Leong Yean, Neha Sunkara, Sanjiban Choudhury
- Abstract: Effective asynchronous planning, or the ability to efficiently reason and plan over states and actions that must happen in parallel or sequentially, is essential for agents that must account for time delays, reason over diverse long-horizon tasks, and collaborate with other agents. While large langu...
-
Robust Gymnasium: A Unified Modular Benchmark for Robust Reinforcement Learning
- Authors: Shangding Gu, Laixi Shi, Muning Wen, Ming Jin, Eric Mazumdar, Yuejie Chi, Adam Wierman, Costas Spanos
- Abstract: Driven by inherent uncertainty and the sim-to-real gap, robust reinforcement learning (RL) seeks to improve resilience against the complexity and variability in agent-environment sequential interactions. Despite the existence of a large number of RL benchmarks, there is a lack of standardized benchm...
-
SmartPretrain: Model-Agnostic and Dataset-Agnostic Representation Learning for Motion Prediction
- Authors: Yang Zhou, Hao Shao, Letian Wang, Steven Waslander, Hongsheng Li, Yu Liu
- Abstract: Predicting the future motion of surrounding agents is essential for autonomous vehicles (AVs) to operate safely in dynamic, human-robot-mixed environments. However, the scarcity of large-scale driving datasets has hindered the development of robust and generalizable motion prediction models, limitin...
-
Stabilizing Reinforcement Learning in Differentiable Multiphysics Simulation
- Authors: Eliot Xing, Vernon Luk, Jean Oh
- Abstract: Recent advances in GPU-based parallel simulation have enabled practitioners to collect large amounts of data and train complex control policies using deep reinforcement learning (RL), on commodity GPUs. However, such successes for RL in robotics have been limited to tasks sufficiently simulated by f...
-
Subtask-Aware Visual Reward Learning from Segmented Demonstrations
- Authors: Changyeon Kim, Minho Heo, Doohyun Lee, Honglak Lee, Jinwoo Shin, Joseph Lim, Kimin Lee
- Abstract: Reinforcement Learning (RL) agents have demonstrated their potential across various robotic tasks. However, they still heavily rely on human-engineered reward functions, requiring extensive trial-and-error and access to target behavior information, often unavailable in real-world settings. This pape...
-
Think Then React: Towards Unconstrained Action-to-Reaction Motion Generation
- Authors: Wenhui Tan, Boyuan Li, Chuhao Jin, Wenbing Huang, Xiting Wang, Ruihua Song
- Abstract: Modeling human-like action-to-reaction generation has significant real-world applications, like human-robot interaction and games.Despite recent advancements in single-person motion generation, it is still challenging to well handle action-to-reaction generation, due to the difficulty of directly pr...
-
VisualAgentBench: Towards Large Multimodal Models as Visual Agents
- Authors: Xiao Liu, Tianjie Zhang, Yu Gu, Iat Long Iong, Song XiXuan, Yifan Xu, Shudan Zhang, Hanyu Lai, Jiadai Sun, Xinyue Yang, Yu Yang, Zehan Qi, Shuntian Yao, Xueqiao Sun, Siyi Cheng, Qinkai Zheng, Hao Yu, Hanchen Zhang, Wenyi Hong, Ming Ding, Lihang Pan, Xiaotao Gu, Aohan Zeng, Zhengxiao Du, Chan Hee Song, Yu Su, Yuxiao Dong, Jie Tang
- Abstract: Large Multimodal Models (LMMs) have ushered in a new era in artificial intelligence, merging capabilities in both language and vision to form highly capable visual agents that are postulated to excel across a myriad of tasks. However, existing benchmarks fail to sufficiently challenge or showcase t...
-
VLAS: Vision-Language-Action Model with Speech Instructions for Customized Robot Manipulation
- Authors: Wei Zhao, Pengxiang Ding, Zhang Min, Zhefei Gong, Shuanghao Bai, Han Zhao, Donglin Wang
- Abstract: Vision-language-action models (VLAs) have recently become highly prevalent in robot manipulation due to its end-to-end architecture and impressive performance. However, current VLAs are limited to processing human instructions in textual form, neglecting the more natural speech modality for human in...
-
X-Gen: Ego-centric Video Prediction by Watching Exo-centric Videos
- Authors: Jilan Xu, Yifei Huang, Baoqi Pei, Junlin Hou, Qingqiu Li, Guo Chen, Yuejie Zhang, Rui Feng, Weidi Xie
- Abstract: Generating videos in the first-person perspective has broad application prospects in the field of augmented reality and embodied intelligence.In this work, we explore the cross-view video prediction task, where given an exo-centric video, the first frame of the corresponding ego-centric video, and t...