Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Modern machine learning models can solve specific tasks with small datasets.
- Generalization capabilities of models are important in robotics due to difficulty of collecting data.
- Robotics Transformer model class exhibits promising scalable model properties.
- Study of model classes and their ability to generalize based on data size, model size, and data diversity.
Paper Content
Introduction
- End-to-end robotic learning typically involves collecting task-specific data
- Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, small-scale datasets and models and towards large, general models pre-trained on broad, large datasets
- Keys to success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures
- Can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks?
- Assembling the right dataset and designing the right model are two main challenges
- RT-1 is a 35M parameter model that takes images and natural language instructions and outputs discretized base and arm actions
- RT-1’s large-scale, real-world training and evaluation show impressive generalization, robustness, and ability to learn from diverse data
- RT-1 can perform over 700 training instructions at 97% success rate and can generalize to new tasks, distractors, and backgrounds better than the next best baseline
- RT-1 can incorporate data from simulation or other robot types, retaining performance on the original tasks and improving generalization to new scenarios
Preprint 2 related work
- Recent works have proposed Transformer-based policies for robotic control
- Transformer used to map language and vision observations to robot actions
- Transformer-based policies used to generalize across robot morphologies and other modalities
- Focus of work is on generalizable and robust real-world robotic manipulation at scale
- Existing works on real-world Transformer-based robotic manipulation focus on learning tasks from demonstrations
- Robotics has a long history of multi-task and language-conditioned learning
Preliminaries
- Aim to learn robot policies to solve language-conditioned tasks from vision
- Sequential decision-making environment
- Policy produces action distribution from language instruction and initial image observation
- Interaction ends when termination condition is achieved
- Goal is to maximize average reward
- Use Transformer to parameterize policy
- Imitation learning to train policy on dataset of successful episodes
System overview
- Goal: build and demonstrate a general robot learning system
- Robot: mobile manipulator with 7 degree-of-freedom arm, two-fingered gripper, and mobile base
- Environments: two real office kitchens and a training environment modelled off these real kitchens
- Data: human-provided demonstrations, annotated with textual description of instruction
- Instructions split into skills (verbs) and objects (nouns)
- Data set: 130k individual demonstrations, 700 distinct task instructions, variety of objects
- Network architecture: Robotics Transformer 1 (RT-1)
- RT-1 takes images and text as input, outputs action for robot at each time step
- Actions: 7 dimensions for arm movement, 3 dimensions for base movement, discrete dimension to switch between 3 modes
- RT-1 performs closed-loop control, commands actions at 3 Hz
Rt-1: robotics transformer
- Tokenize images, text, and actions
- Describe RT-1 model architecture
- Attain runtime speed for real-time control
- Describe data collection procedure and skills/instructions in dataset
Model
- Model is built on Transformer architecture
- Takes history of images and task description as input
- Outputs tokenized actions
- Data-efficient and compact tokenization of images and language instruction
- Images passed through EfficientNet-B3 model
- Language instruction embedded via Universal Sentence Encoder
- FiLM layers added to pretrained EfficientNet to condition image encoder
- TokenLearner used to map large number of tokens to smaller number
- Transformer decoder-only sequence model with 8 self-attention layers
- Actions discretized into 256 bins
- Categorical cross-entropy entropy objective used
- Inference speed limited by model size
- Two techniques used to speed up inference
Data
- Aim to build a system with high performance, generalization and robustness
- Collected ∼130k robot demonstrations over 17 months in a series of office kitchen segments
Skills and instructions.
- Definition of a task is inconsistent in literature
- Count number of language instructions system can perform
- RT-1 can perform over 700 language instructions in multiple realistic office kitchen environments
- Group instructions by verbs used in them (referred to as skills)
- Skills include picking, placing, opening/closing drawers, getting items in/out drawers, placing elongated items up-right, knocking them over, pulling napkins and opening jars
- Expanded object diversity for “pick” skill to test generalization to new instructions and ability to perform many tasks
Preprint
Experiments
- Experiments seek to answer questions about RT-1’s ability to learn and generalize
- Comparing RT-1 to two baseline architectures: Gato and BC-Z
- Gato and BC-Z are trained on data described in Sec. 5.2
- RT-1, Gato and BC-Z are compared in terms of design decisions
- Evaluating success rate in experiments to measure performance on training instructions, generalization to unseen instructions, robustness to backgrounds and distractors, and performance in long-horizon scenarios
- Over 3000 real-world trials conducted, making it one of the largest scale evaluations of a robot learning system to-date
Experimental setup
- Evaluated RT-1 with mobile manipulators from Everyday Robots in three environments
- Two real office kitchens and a training environment modelled off these real kitchens
- Evaluated performance on training tasks and generalization to new tasks, robustness to unseen environments, and performance when chained together for long-horizon tasks
- Tested over 200 tasks in evaluation
- Evaluated generalization to unseen tasks with 21 novel, unseen instructions
- Evaluated robustness with 30 real-world tasks for distractor robustness and 22 tasks for background robustness
- Evaluated generalization to long-horizon scenarios with 15 instructions in two real kitchens
Can rt-1 learn to perform a large number of instructions, and to generalize to new tasks, objects and environments?
- RT-1 outperforms prior models significantly on seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds
- RT-1 successfully performs 97% of more than 200 instructions on seen tasks
- RT-1 is able to generalize to novel instructions, performing 76% of never-before-seen instructions
- RT-1 is quite robust, successfully executing 83% of distractor robustness tasks and 59% of background robustness tasks
- RT-1 is able to generalize to realistic instructions in a real kitchen
- RT-1 is the most robust on all levels of generalization
Can we push the resulting model further by incorporating heterogeneous
- RT-1 can incorporate and learn from vastly different data sources
- RT-1 can absorb both real and simulation data
- RT-1 does not lose performance when adding simulation data
- RT-1 can absorb data from different robots
- RT-1 minimally impacts standard classroom evaluation performance
- RT-1 results in almost a 2x improvement in generalization to the Binpicking evaluation
- RT-1 can acquire new skills through observing other robots’ experiences
- RT-1 can combine many more multi-robot datasets to enhance the robot capabilities
How do various methods generalize long-horizon robotic scenarios?
- Evaluated method in two real kitchens
- SayCan combines low-level instructions to perform high-level instructions
- RT-1 performs best with 67% success rate in Kitchen1
- RT-1 able to operate unseen drawers in Kitchen2
- Data diversity more essential than data quantity
Conclusions, limitations and future work
- Developed RT-1, a robot learning method that can absorb large amounts of data and scale with data quantity and diversity
- Trained RT-1 on 130k episodes collected over 17 months with 13 robots
- RT-1 can perform over 700 instructions at 97% success rate and generalize to new tasks, objects and environments better than previously published baselines
- RT-1 can absorb heterogeneous data from simulation and other robot morphologies without sacrificing original-tasks performance
- RT-1 can execute very long-horizon tasks with as many as 50 steps
- Limitations include imitation learning, limited generalization to new instructions, and only covers a small portion of possible robotic manipulation tasks
- Open-sourced the code for RT-1
- Leveraged simulation for “real to sim” transfer to evaluate model performance
- Evaluated on real-world randomized scenes and over 3000 total rollouts
- RT-1 shows high-performance and robustness and can learn from heterogenous data
- Evaluated on 744 seen tasks and 53 unseen tasks
- Tested three tasks with incrementally more distractor objects added to the scene
- Tested six tasks with incrementally more challenging backgrounds and counter textures
- Tested in a real office kitchen with a variety of skills
- Tested RT-1’s ability to absorb both real and simulation data