ERRA: An Embodied Representation and Reasoning Architecture for Long-horizon Language-conditioned Manipulation Tasks
Abstract
This letter introduces ERRA, an embodied learning architecture that enables robots to jointly obtain three fundamental capabilities (reasoning, planning, and inter<PRE_TAG>action</POST_TAG>) for solving long-horizon language-conditioned manipulation tasks. ERRA is based on tightly-coupled probabilistic inferences at two granularity levels. Coarse-resolution inference is formulated as sequence generation through a large language model, which infers <PRE_TAG>action language</POST_TAG> from natural language instruction and environment state. The robot then zooms to the fine-resolution inference part to perform the concrete action corresponding to the <PRE_TAG>action language</POST_TAG>. Fine-resolution inference is constructed as a Markov decision process, which takes <PRE_TAG>action language</POST_TAG> and environmental sensing as observations and outputs the action. The results of <PRE_TAG>action execution</POST_TAG> in environments provide feedback for subsequent coarse-resolution reasoning. Such coarse-to-fine inference allows the robot to decompose and achieve long-horizon tasks interactively. In extensive experiments, we show that ERRA can complete various long-horizon manipulation tasks specified by abstract language instructions. We also demonstrate successful generalization to the novel but similar natural language instructions.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper