Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.


  • Modern machine learning models can solve specific tasks with small datasets.
  • Generalization capabilities of models are important in robotics due to difficulty of collecting data.
  • Robotics Transformer model class exhibits promising scalable model properties.
  • Study of model classes and their ability to generalize based on data size, model size, and data diversity.

Paper Content


  • End-to-end robotic learning typically involves collecting task-specific data
  • Recent years have seen a transformation in vision, NLP, and other domains, away from siloed, small-scale datasets and models and towards large, general models pre-trained on broad, large datasets
  • Keys to success of such models lie with open-ended task-agnostic training, combined with high-capacity architectures
  • Can we train a single, capable, large multi-task backbone model on data consisting of a wide variety of robotic tasks?
  • Assembling the right dataset and designing the right model are two main challenges
  • RT-1 is a 35M parameter model that takes images and natural language instructions and outputs discretized base and arm actions
  • RT-1’s large-scale, real-world training and evaluation show impressive generalization, robustness, and ability to learn from diverse data
  • RT-1 can perform over 700 training instructions at 97% success rate and can generalize to new tasks, distractors, and backgrounds better than the next best baseline
  • RT-1 can incorporate data from simulation or other robot types, retaining performance on the original tasks and improving generalization to new scenarios
  • Recent works have proposed Transformer-based policies for robotic control
  • Transformer used to map language and vision observations to robot actions
  • Transformer-based policies used to generalize across robot morphologies and other modalities
  • Focus of work is on generalizable and robust real-world robotic manipulation at scale
  • Existing works on real-world Transformer-based robotic manipulation focus on learning tasks from demonstrations
  • Robotics has a long history of multi-task and language-conditioned learning


  • Aim to learn robot policies to solve language-conditioned tasks from vision
  • Sequential decision-making environment
  • Policy produces action distribution from language instruction and initial image observation
  • Interaction ends when termination condition is achieved
  • Goal is to maximize average reward
  • Use Transformer to parameterize policy
  • Imitation learning to train policy on dataset of successful episodes

System overview

  • Goal: build and demonstrate a general robot learning system
  • Robot: mobile manipulator with 7 degree-of-freedom arm, two-fingered gripper, and mobile base
  • Environments: two real office kitchens and a training environment modelled off these real kitchens
  • Data: human-provided demonstrations, annotated with textual description of instruction
  • Instructions split into skills (verbs) and objects (nouns)
  • Data set: 130k individual demonstrations, 700 distinct task instructions, variety of objects
  • Network architecture: Robotics Transformer 1 (RT-1)
  • RT-1 takes images and text as input, outputs action for robot at each time step
  • Actions: 7 dimensions for arm movement, 3 dimensions for base movement, discrete dimension to switch between 3 modes
  • RT-1 performs closed-loop control, commands actions at 3 Hz

Rt-1: robotics transformer

  • Tokenize images, text, and actions
  • Describe RT-1 model architecture
  • Attain runtime speed for real-time control
  • Describe data collection procedure and skills/instructions in dataset


  • Model is built on Transformer architecture
  • Takes history of images and task description as input
  • Outputs tokenized actions
  • Data-efficient and compact tokenization of images and language instruction
  • Images passed through EfficientNet-B3 model
  • Language instruction embedded via Universal Sentence Encoder
  • FiLM layers added to pretrained EfficientNet to condition image encoder
  • TokenLearner used to map large number of tokens to smaller number
  • Transformer decoder-only sequence model with 8 self-attention layers
  • Actions discretized into 256 bins
  • Categorical cross-entropy entropy objective used
  • Inference speed limited by model size
  • Two techniques used to speed up inference


  • Aim to build a system with high performance, generalization and robustness
  • Collected ∼130k robot demonstrations over 17 months in a series of office kitchen segments

Skills and instructions.

  • Definition of a task is inconsistent in literature
  • Count number of language instructions system can perform
  • RT-1 can perform over 700 language instructions in multiple realistic office kitchen environments
  • Group instructions by verbs used in them (referred to as skills)
  • Skills include picking, placing, opening/closing drawers, getting items in/out drawers, placing elongated items up-right, knocking them over, pulling napkins and opening jars
  • Expanded object diversity for “pick” skill to test generalization to new instructions and ability to perform many tasks



  • Experiments seek to answer questions about RT-1’s ability to learn and generalize
  • Comparing RT-1 to two baseline architectures: Gato and BC-Z
  • Gato and BC-Z are trained on data described in Sec. 5.2
  • RT-1, Gato and BC-Z are compared in terms of design decisions
  • Evaluating success rate in experiments to measure performance on training instructions, generalization to unseen instructions, robustness to backgrounds and distractors, and performance in long-horizon scenarios
  • Over 3000 real-world trials conducted, making it one of the largest scale evaluations of a robot learning system to-date

Experimental setup

  • Evaluated RT-1 with mobile manipulators from Everyday Robots in three environments
  • Two real office kitchens and a training environment modelled off these real kitchens
  • Evaluated performance on training tasks and generalization to new tasks, robustness to unseen environments, and performance when chained together for long-horizon tasks
  • Tested over 200 tasks in evaluation
  • Evaluated generalization to unseen tasks with 21 novel, unseen instructions
  • Evaluated robustness with 30 real-world tasks for distractor robustness and 22 tasks for background robustness
  • Evaluated generalization to long-horizon scenarios with 15 instructions in two real kitchens

Can rt-1 learn to perform a large number of instructions, and to generalize to new tasks, objects and environments?

  • RT-1 outperforms prior models significantly on seen tasks, generalization to unseen tasks, and robustness to distractors and backgrounds
  • RT-1 successfully performs 97% of more than 200 instructions on seen tasks
  • RT-1 is able to generalize to novel instructions, performing 76% of never-before-seen instructions
  • RT-1 is quite robust, successfully executing 83% of distractor robustness tasks and 59% of background robustness tasks
  • RT-1 is able to generalize to realistic instructions in a real kitchen
  • RT-1 is the most robust on all levels of generalization

Can we push the resulting model further by incorporating heterogeneous

  • RT-1 can incorporate and learn from vastly different data sources
  • RT-1 can absorb both real and simulation data
  • RT-1 does not lose performance when adding simulation data
  • RT-1 can absorb data from different robots
  • RT-1 minimally impacts standard classroom evaluation performance
  • RT-1 results in almost a 2x improvement in generalization to the Binpicking evaluation
  • RT-1 can acquire new skills through observing other robots’ experiences
  • RT-1 can combine many more multi-robot datasets to enhance the robot capabilities

How do various methods generalize long-horizon robotic scenarios?

  • Evaluated method in two real kitchens
  • SayCan combines low-level instructions to perform high-level instructions
  • RT-1 performs best with 67% success rate in Kitchen1
  • RT-1 able to operate unseen drawers in Kitchen2
  • Data diversity more essential than data quantity

Conclusions, limitations and future work

  • Developed RT-1, a robot learning method that can absorb large amounts of data and scale with data quantity and diversity
  • Trained RT-1 on 130k episodes collected over 17 months with 13 robots
  • RT-1 can perform over 700 instructions at 97% success rate and generalize to new tasks, objects and environments better than previously published baselines
  • RT-1 can absorb heterogeneous data from simulation and other robot morphologies without sacrificing original-tasks performance
  • RT-1 can execute very long-horizon tasks with as many as 50 steps
  • Limitations include imitation learning, limited generalization to new instructions, and only covers a small portion of possible robotic manipulation tasks
  • Open-sourced the code for RT-1
  • Leveraged simulation for “real to sim” transfer to evaluate model performance
  • Evaluated on real-world randomized scenes and over 3000 total rollouts
  • RT-1 shows high-performance and robustness and can learn from heterogenous data
  • Evaluated on 744 seen tasks and 53 unseen tasks
  • Tested three tasks with incrementally more distractor objects added to the scene
  • Tested six tasks with incrementally more challenging backgrounds and counter textures
  • Tested in a real office kitchen with a variety of skills
  • Tested RT-1’s ability to absorb both real and simulation data