PDC: Emergent Active Perception and Dexterity of Simulated Humanoids from Visual Reinforcement Learning

PDC is a framework for learning vision-centric dexterous full body control for simulated humanoids. We constructed a system of visual cues, akin to those common in games and VR systems. These cues guide the agent towards objects to interact with, how to perform the interaction, and what to do with them.

We observe that active perception (search and gaze control) in addition to dexterity emerges when trained on a diverse set of tasks, within a diverse set of scenes, directly from partially observable visual inputs.


Tabletop

First, the PDC agent learns to master the tabletop task. This task serves two goals. (1) It enables systematic evaluation of the importance of the various visual cues. (2) Through further fine-tuning, it provides a stepping stone towards more complex tasks.

In a massively parallelized physics simulator, the PDC agent learns through trial-and-error how to search for objects, and extract the commands provided in the visual field. The target object is marked using a 2D green overlay (segmentation mask) and the target location to transport the object is shown using a 3D visual marker. Onscreen indicators tell the agent which hand to use. White tells it "this hand should not be used", whereas purple means "get ready" and blue means "make contact". The agent learns to transport objects and release on command.

We show how the agent can be instructed to grab using visual cues. Our system supports left hand (top), right hand (center), and both hands (bottom).

OMOMO

Through similar training, the same controller can adapt to support objects from the OMOMO and OakInk dataset

OakInk

Unseen objects

By training with diverse objects and color randomization techniques, our agent can generalize to new and unseen objects. Notice how the agent releases the object when the visual marker reverts from blue to purple.

Kitchens, object transport

Once the agent has mastered the table top setting, we continue training it in a more complex kitchen scene.

The kitchens are diverse. They span multiple kitchen configurations (galley, L-shaped, U-shaped, island, etc..), and contain a variety of objects. We randomize the textures, placement of elements in the scene (cabinets, drawers, fridge, oven, etc...), the object placements, and the initial humanoid pose.

The agent grabs a target object and transports it to the designated location, indicated by the 3D marker. Then, it releases the object on command, followed by grabbing the next target object.

Drawers

In addition to the object grab-and-transport task, the agent is also trained to open articulated drawers. Here, the drawers have a spring mechanism. To prevent it from closing, the agent must open the drawer and hold it open.

The top right agent is spawned with the drawer behind it and at a distance. The agent searches for the drawer, navigates towards it, grabs the handle, and pulls it open.

Emergent Search

Here we demonstrate a scenario where the object is not in view. Although the agent initially focuses on the marker, it quickly reverts to scanning the room looking for the target object. Once the object has been grabbed, the agent quickly turns back, remembering where it last saw the marker.

Here we demonstrate a scenario where the marker is not in view. Once the object has been lifted, the agent scans the countertop for the target marker.

Memory and Occlusion

When approaching the object the agent's view becomes occluded by the top cabinet, hiding the target object. As our agent has a memory module (GRU) it is able to retain knowledge of the previously observed scene and grasp the object despite the occlusion.

Failure cases

As the model is trained to adhere to hand indication, it requires a method for determining which hand/s to use. Here, we show two examples of large objects that it fails to grab using one hand, yet succeeds when instructed to use both.