First, the PDC agent learns to master the tabletop task. This task serves two goals. (1) It enables systematic evaluation of the importance of the various visual cues. (2) Through further fine-tuning, it provides a stepping stone towards more complex tasks.
In a massively parallelized physics simulator, the PDC agent learns through trial-and-error how to search for objects, and extract the commands provided in the visual field. The target object is marked using a 2D green overlay (segmentation mask) and the target location to transport the object is shown using a 3D visual marker. Onscreen indicators tell the agent which hand to use. White tells it "this hand should not be used", whereas purple means "get ready" and blue means "make contact". The agent learns to transport objects and release on command.
We show how the agent can be instructed to grab using visual cues. Our system supports left hand (top), right hand (center), and both hands (bottom).
Through similar training, the same controller can adapt to support objects from the OMOMO and OakInk dataset
By training with diverse objects and color randomization techniques, our agent can generalize to new and unseen objects. Notice how the agent releases the object when the visual marker reverts from blue to purple.
Once the agent has mastered the table top setting, we continue training it in a more complex kitchen scene.
The kitchens are diverse. They span multiple kitchen configurations (galley, L-shaped, U-shaped, island, etc..), and contain a variety of objects. We randomize the textures, placement of elements in the scene (cabinets, drawers, fridge, oven, etc...), the object placements, and the initial humanoid pose.
The agent grabs a target object and transports it to the designated location, indicated by the 3D marker. Then, it releases the object on command, followed by grabbing the next target object.
In addition to the object grab-and-transport task, the agent is also trained to open articulated drawers. Here, the drawers have a spring mechanism. To prevent it from closing, the agent must open the drawer and hold it open.
The top right agent is spawned with the drawer behind it and at a distance. The agent searches for the drawer, navigates towards it, grabs the handle, and pulls it open.
Here we demonstrate a scenario where the object is not in view. Although the agent initially focuses on the marker, it quickly reverts to scanning the room looking for the target object. Once the object has been grabbed, the agent quickly turns back, remembering where it last saw the marker.
Here we demonstrate a scenario where the marker is not in view. Once the object has been lifted, the agent scans the countertop for the target marker.
When approaching the object the agent's view becomes occluded by the top cabinet, hiding the target object. As our agent has a memory module (GRU) it is able to retain knowledge of the previously observed scene and grasp the object despite the occlusion.
As the model is trained to adhere to hand indication, it requires a method for determining which hand/s to use. Here, we show two examples of large objects that it fails to grab using one hand, yet succeeds when instructed to use both.