Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Modern machine learning models can solve specific tasks with small datasets.
Generalization capabilities of models are important in robotics due to difficulty of collecting data.
Robotics Transformer model class exhibits promising scalable model properties.
Study of model classes and their ability to generalize based on data size, model size, and data diversity.

Paper Content

Introduction

End-to-end robotic learning typically involves collecting task-specific data
Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, small-scale datasets and models and towards large, general models pre-trained on broad, large datasets
Keys to success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures
Can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks?
Assembling the right dataset and designing the right model are two main challenges
RT-1 is a 35M parameter model that takes images and natural language instructions and outputs discretized base and arm actions
RT-1’s large-scale, real-world training and evaluation show impressive generalization, robustness, and ability to learn from diverse data
RT-1 can perform over 700 training instructions at 97% success rate and can generalize to new tasks, distractors, and backgrounds better than the next best baseline
RT-1 can incorporate data from simulation or other robot types, retaining performance on the original tasks and improving generalization to new scenarios

Recent works have proposed Transformer-based policies for robotic control
Transformer used to map language and vision observations to robot actions
Transformer-based policies used to generalize across robot morphologies and other modalities
Focus of work is on generalizable and robust real-world robotic manipulation at scale
Existing works on real-world Transformer-based robotic manipulation focus on learning tasks from demonstrations
Robotics has a long history of multi-task and language-conditioned learning

Preliminaries

Aim to learn robot policies to solve language-conditioned tasks from vision
Sequential decision-making environment
Policy produces action distribution from language instruction and initial image observation
Interaction ends when termination condition is achieved
Goal is to maximize average reward
Use Transformer to parameterize policy
Imitation learning to train policy on dataset of successful episodes

System overview

Goal: build and demonstrate a general robot learning system
Robot: mobile manipulator with 7 degree-of-freedom arm, two-fingered gripper, and mobile base
Environments: two real office kitchens and a training environment modelled off these real kitchens
Data: human-provided demonstrations, annotated with textual description of instruction
Instructions split into skills (verbs) and objects (nouns)
Data set: 130k individual demonstrations, 700 distinct task instructions, variety of objects
Network architecture: Robotics Transformer 1 (RT-1)
RT-1 takes images and text as input, outputs action for robot at each time step
Actions: 7 dimensions for arm movement, 3 dimensions for base movement, discrete dimension to switch between 3 modes
RT-1 performs closed-loop control, commands actions at 3 Hz

Rt-1: robotics transformer

Tokenize images, text, and actions
Describe RT-1 model architecture
Attain runtime speed for real-time control
Describe data collection procedure and skills/instructions in dataset

Model

Model is built on Transformer architecture
Takes history of images and task description as input
Outputs tokenized actions
Data-efficient and compact tokenization of images and language instruction
Images passed through EfficientNet-B3 model
Language instruction embedded via Universal Sentence Encoder
FiLM layers added to pretrained EfficientNet to condition image encoder
TokenLearner used to map large number of tokens to smaller number
Transformer decoder-only sequence model with 8 self-attention layers
Actions discretized into 256 bins
Categorical cross-entropy entropy objective used
Inference speed limited by model size
Two techniques used to speed up inference

Data

Aim to build a system with high performance, generalization and robustness
Collected ∼130k robot demonstrations over 17 months in a series of office kitchen segments

Skills and instructions.

Definition of a task is inconsistent in literature
Count number of language instructions system can perform
RT-1 can perform over 700 language instructions in multiple realistic office kitchen environments
Group instructions by verbs used in them (referred to as skills)
Skills include picking, placing, opening/closing drawers, getting items in/out drawers, placing elongated items up-right, knocking them over, pulling napkins and opening jars
Expanded object diversity for “pick” skill to test generalization to new instructions and ability to perform many tasks

Preprint

Experiments

Experiments seek to answer questions about RT-1’s ability to learn and generalize
Comparing RT-1 to two baseline architectures: Gato and BC-Z
Gato and BC-Z are trained on data described in Sec. 5.2
RT-1, Gato and BC-Z are compared in terms of design decisions
Evaluating success rate in experiments to measure performance on training instructions, generalization to unseen instructions, robustness to backgrounds and distractors, and performance in long-horizon scenarios
Over 3000 real-world trials conducted, making it one of the largest scale evaluations of a robot learning system to-date

Experimental setup

Evaluated RT-1 with mobile manipulators from Everyday Robots in three environments
Two real office kitchens and a training environment modelled off these real kitchens
Evaluated performance on training tasks and generalization to new tasks, robustness to unseen environments, and performance when chained together for long-horizon tasks
Tested over 200 tasks in evaluation
Evaluated generalization to unseen tasks with 21 novel, unseen instructions
Evaluated robustness with 30 real-world tasks for distractor robustness and 22 tasks for background robustness
Evaluated generalization to long-horizon scenarios with 15 instructions in two real kitchens

Can rt-1 learn to perform a large number of instructions, and to generalize to new tasks, objects and environments?

RT-1 outperforms prior models significantly on seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds
RT-1 successfully performs 97% of more than 200 instructions on seen tasks
RT-1 is able to generalize to novel instructions, performing 76% of never-before-seen instructions
RT-1 is quite robust, successfully executing 83% of distractor robustness tasks and 59% of background robustness tasks
RT-1 is able to generalize to realistic instructions in a real kitchen
RT-1 is the most robust on all levels of generalization

Can we push the resulting model further by incorporating heterogeneous

RT-1 can incorporate and learn from vastly different data sources
RT-1 can absorb both real and simulation data
RT-1 does not lose performance when adding simulation data
RT-1 can absorb data from different robots
RT-1 minimally impacts standard classroom evaluation performance
RT-1 results in almost a 2x improvement in generalization to the Binpicking evaluation
RT-1 can acquire new skills through observing other robots’ experiences
RT-1 can combine many more multi-robot datasets to enhance the robot capabilities

How do various methods generalize long-horizon robotic scenarios?

Evaluated method in two real kitchens
SayCan combines low-level instructions to perform high-level instructions
RT-1 performs best with 67% success rate in Kitchen1
RT-1 able to operate unseen drawers in Kitchen2
Data diversity more essential than data quantity

Conclusions, limitations and future work

Developed RT-1, a robot learning method that can absorb large amounts of data and scale with data quantity and diversity
Trained RT-1 on 130k episodes collected over 17 months with 13 robots
RT-1 can perform over 700 instructions at 97% success rate and generalize to new tasks, objects and environments better than previously published baselines
RT-1 can absorb heterogeneous data from simulation and other robot morphologies without sacrificing original-tasks performance
RT-1 can execute very long-horizon tasks with as many as 50 steps
Limitations include imitation learning, limited generalization to new instructions, and only covers a small portion of possible robotic manipulation tasks
Open-sourced the code for RT-1
Leveraged simulation for “real to sim” transfer to evaluate model performance
Evaluated on real-world randomized scenes and over 3000 total rollouts
RT-1 shows high-performance and robustness and can learn from heterogenous data
Evaluated on 744 seen tasks and 53 unseen tasks
Tested three tasks with incrementally more distractor objects added to the scene
Tested six tasks with incrementally more challenging backgrounds and counter textures
Tested in a real office kitchen with a variety of skills
Tested RT-1’s ability to absorb both real and simulation data

Link to paper#

Abstract#

Paper Content#

Introduction#

Preprint 2 related work#

Preliminaries#

System overview#

Rt-1: robotics transformer#

Model#

Data#

Skills and instructions.#

Preprint#

Experiments#

Experimental setup#

Can rt-1 learn to perform a large number of instructions, and to generalize to new tasks, objects and environments?#

Can we push the resulting model further by incorporating heterogeneous#

How do various methods generalize long-horizon robotic scenarios?#

Conclusions, limitations and future work#

Link to paper

Abstract

Paper Content

Introduction

Preprint 2 related work

Preliminaries

System overview

Rt-1: robotics transformer

Model

Data

Skills and instructions.

Preprint

Experiments

Experimental setup

Can rt-1 learn to perform a large number of instructions, and to generalize to new tasks, objects and environments?

Can we push the resulting model further by incorporating heterogeneous

How do various methods generalize long-horizon robotic scenarios?

Conclusions, limitations and future work