Grasping real-world objects is considered one of the more iconic examples of the current limits of machine intelligence. While humans can easily grasp and pick up objects they’ve never seen before, even the most advanced robotic arms can’t manipulate objects that they weren’t trained to handle. Recent developments in reinforcement learning (RL) have allowed for the creation of robots with better manipulation skills, but even state-of-the-art technology leaves much to be desired. A key challenge in research is the scarcity of real-world training data, as even the biggest research institutions don’t have more than dozens of robots.
RCAN, a new paper from X (previously Google X), Google, and DeepMind, presents a novel technique which allows robots to learn grasping from simulation and apply the learned skills to real-world situations. The paper vastly improves on domain randomization, a popular technique for training in simulation, by using pix2pix, a GAN-based technique to convert images to different styles, and combining it with QT-Opt, a state-of-the-art RL method first presented in 2018. The result is a grasping robot with state-of-the-art capabilities – the RCAN robot can pick up 70% of objects in a tray without any real-world experience and reaches top results (91%) using 99% less real-world data than equivalent algorithms.
Background
RCAN is relatively simple and elegant but builds on several previous breakthroughs in RL and GANs. We’ll detail these past achievements in short to allow readers to thoroughly understand how RCAN works.
Domain randomization
Domain randomization is the idea of generating a complex environment in simulation, running experiments in the simulated environment (for instance with a robot), and finally applying the lessons to real-world tasks. While the concept itself is rather trivial, the implementation challenge is often immense – how to create a simulated environment which both faithfully emulates the real world and allows the actor in the simulation (the robot) to gain valuable real-world experience.
While the concept of domain randomization has been around for a while and has been applied in physics and other fields, it was only popularized in the context of robotic machine learning by OpenAI in a 2017 paper. In the OpenAI study, the researchers laid objects on a tray with randomized texture, varied their pose, randomized the relative position between the objects, and applied diverse lighting conditions and different camera angles. They then trained a simulation robot to pick up the objects in the various conditions and were able to achieve impressive results in real-life object fetching experiments.
QT-Opt
QT-Opt is a reinforcement learning algorithm which allows robots to improve their grasping capability after watching hundreds of thousands of real-world grasping examples. At its heart is a large (1.2m parameters) CNN which represents the robot’s grasping logic (its Q function).
RL algorithms are often divided into two categories:
- Open-loop systems execute a policy while ignoring the environmental consequences of the agent’s activity. In robotics, an example of an open-loop system would be an algorithm which attempts to grasp an object by finding an ideal position and pose for the grasping arm and then aiming for that location, regardless of possible interruptions along the way.
- In closed-loop systems, the policy adapts itself based on the real-time performance of the arm, incorporating the results of their actions into the algorithm’s logic. In robotics, an example of a closed-loop system would be an algorithm that recalculates the grasping arm’s movement if the arm is blocked or suffers any other setback in the grasping process.
Closed-loop systems are more robust and have the potential of better results in practice, but also tend to be harder to train. QT-Opt successfully trains a neural network in a closed-loop system, allowing the robot to learn useful techniques like

In its vanilla form, QT-Opt is an off-policy algorithm, in this case meaning that the robot’s policy doesn’t change by training its own grasping process directly but only through learning from previously collected grasping attempts. The off-policy form allows it to achieve a state-of-the-art result of 87% grasp success rate in a common bin-emptying challenge, after training on 580,000 real-world examples. When the researchers added on-policy learning, QT-Opt actually initially decreased in effectiveness to 85% after 5,000 real-world grasps, but finally reached a 96% grasp success rate after 28,000 real-world grasps.
Off-Policy Training (580k samples) | Off-Policy + 5,000 on-policy samples | Off-Policy + 28,000 on-policy samples | |
---|---|---|---|
Grasp Success Rate | 87% | 85% | 96% |
cGAN
Generative Adversarial Networks (GANs) are systems which consist of two neural networks – a generator and a discriminator. The generator learns to create a fake but believable output, and the discriminator learns to discern which outputs are fake and which are real. The most common use case of GANs is in image generation, wherein the generator aims to create an image of a certain style or with certain characteristics.
In common image GANs, the generator learns by receiving a noise vector as input and trying to turn the noise into a believable image. If the discriminator accepts the image, the generator neural network receives positive feedback, whereas if the discriminator rejects the image then the generator neural network receives negative feedback. The discriminator trains with real images (true samples) and the outputs of the generator network (false samples).
Image GANs have been shown to produce believable and intriguing results, as exemplified in several recent papers.
In 2014, Mirza and Osindero expanded on the concept of GANs with cGANs (conditional GANs). In cGANs, the system receives as input not only noise and real images but also a third kind of input – a label – which is a condition on which the network is trained. This label is usually an image of a certain style, and it presumably assists the network in generating this kind of style. In cGANs, the generator receives as input both a noise vector and the label, and the discriminator receives as input both the label and a true/false sample.
pix2pix
In 2016, a team from the University of Berkeley presented pix2pix, a technique which uses

How RCAN works
Now that we’ve established the previous research which RCAN builds upon, we can describe its mechanism.
The key insight in RCAN (Randomized-to-Canonical Adaptation Networks) is that despite Kalashnikov’s success with QT-Opt, there seems to be a limit to the effectiveness of training grasping robots with full resolution images. The infinite variety of possible lighting styles and poses result in subtle changes which confuse the CNN, and training a CNN to directly notice these distinctions seems to require too much training data.
Therefore, RCAN divides both the training and policy execution (inference) process into two stages:
- The image observed by the robot is translated to a specific image style, known as a canonical style. The canonical style was designed to show clear distinctions between objects and object components by presenting them in different colors (see image)
- The robot attempts to grasp objects by looking at the canonical version of its environment, thus gaining experience at grasping objects with a canonical-style view of the world.

In total, the RCAN team created a system with four distinct stages:
- Generating training data for image translation – The RCAN team applies domain randomization, generating a wide variety of robot grasping scenarios. They then translate each image into a canonical version of itself, a process which doesn’t require a specialized algorithm thanks to the inherent knowledge of every object’s position is
in the simulator screen. Naturally, this knowledge doesn’t exist when the robot operates in the real world and therefore it’s necessary to generate data to train an image translation module for the robot. Image translation – The translation from simulated images to the canonical style is done with a pix2pix cGAN which receives as input the simulated images (“label” incGAN terms) and their canonical version (“real image” incGAN terms) and learns to generate a canonical version of a given image.

- QT-Opt training in simulation – As in QT-Opt, images are simulated to include various sources of lighting, pose, etc, and the robot trains on the simulated images. Unlike QT-Opt, the robot doesn’t learn its grasping technique on the raw simulated images but on their canonical versions, which are created via the pix2pix image translator.
- Grasping in the real world – After training, the robot attempts to grasp objects in the real world by first translating the real-world raw image to a canonical version, and then running its learned policy on the canonical version.
To further improve performance, the RCAN team added two additional types of images to the canonical simulation – a translation of images to a mask version, which clearly differentiates between different objects in the scene, and a translation of images to a depth analysis version. These are used as input to the policy CNN both in the training phase and in the policy execution (inference) phase.

Results
RCAN achieves a 70% success rate in grasping real-world objects without any real-world training and reaches a 91% success rate after 5,000 real-world grasps, surpassing the previous state-of-the-art result of 85%. Its benefits are more limited after 28,000 real-world grasps, where it reaches 94%, trailing the state-of-the-art-result of 96% achieved by Kalashnikov et al with QT-Opt.
0 Grasps | 5,000 Grasps | 28,000 Grasps | |
---|---|---|---|
Kalashnikov et al. | Not Applicable | 85% | 96% |
RCAN | 70% | 91% | 94% |
Therefore, it appears that the key value of RCAN is in cases where training data is sparse, a massive accomplishment considering the general ML difficulty of learning from sparse training data.
Compute & Equipment
The real-world experiments were performed using an unnamed number of Kuka IIWA grasping robots. The simulator uses the Bullet physics engine.
Implementation Details
The team has not provided data on the implementation of RCAN and has not indicated that it will open-source the code.
Conclusion
RCAN is a tour de force of applying training in simulation to solve real-world problems, achieving excellent results while relying on very little training data. For robots to be useful in previously unknown environments they’ll have to learn based on only small amounts of training data, and with these results, RCAN may provide a hint on how to bring robots closer to practicality.
Sign up to our monthly newsletter
Stay updated with the latest research in Deep Learning
yo its post very cool