Real-Time Simulated Avatar from Head-Mounted Sensors



Abstract

We present SimXR, a method for controlling a simulated avatar from information (headset pose and cameras) obtained from AR/VR headsets. Due to the challenging viewpoint of head-mounted cameras, the human body is often clipped out of view, making traditional image-based egocentric pose estimation methods challenging. On the other hand, headset poses provide valuable information about overall body motion, but lack fine-grained details about the hands and feet. To synergize headset poses with cameras, we control a humanoid to track headset movement while analyzing input images to decide body movement. When body parts are seen, the movements of hands and feet will be guided by the images; when unseen, the laws of physics guide the controller to generate plausible motion. We design an end-to-end method that does not rely on any intermediate representations and learns to directly map from images and headset poses to humanoid control signals. To train our method, we also propose a large-scale synthetic dataset created using camera configurations compatible with a commercially available VR headset (Quest 2) and show promising results on real-world captures. To demonstrate the applicability of our framework, we also test it on an AR headset with a forward-facing camera.



  1. Real-World Results
  2. Synthetic Data Results
  3. Comparison with SOTA
  4. Failure Cases


Real World Sequences

In this section, we visualize SimXR's performance on data captured from real-world XR headsets.

Here is a highlight video where we visualize some of the hard motions that SimXR can perform, including fast punches, kicks, tennis swings, etc.

Quest

We record testing sequence using the Quest 2 headset in two different scenes (office and conference room) with three different subjects. Here we visualize full sequences from the real-world dataset. SimXR successfully controls the humanoid to track the headset movement for > 2 mins. The red dot indicates the pose of the headset, which are often occluded by the humanoid's head.

Subject 1:

Subject 2:

Subject 3:

Aria

Here we visualize three test sequences from the Aria Digital Twin (ADT) dataset. While motion from ADT is simpler, the viewing angle from the AR headset is much more challenging. From the videos, we can see that the humanoid will raise its hands when hands come into view. The red dot indicates the pose of the headset.

Synthetic

Here we visualize three test sequences from the synthetic dataset. Since we randomize the background and lighting at a per-frame level, our network is forced to ignore backgrounds and only focus on body movements. The red dots indicate the pose of the headset.

Test sequence: dancing
Test sequence: gestures
Test sequence: mixing drinks

Comparison with SOTA

UnrealEgo

In this section, we compare with the SOTA vision-based method, UnrealEgo. We test on the challenging real-world data captures. Notice that UnrealEgo and SimXR use the same training images and motion, and SimXR uses an order of magnitude smaller model size. UnrealEgo estimates pose in the camera's coordinate system, while SimXR estimates pose in the global coordinate system. We can see that SimXR can estimate much more stable and less jittery motion than UnrealEgo, signifying the feasibility and advantage of using simulated characters for this challenging task.

Comparison with UnrealEgo on Quest
Comparison with UnrealEgo on Aria

KinPoly-v

Compared to KinPoly-v, SimXR's end-to-end learning framework is more robust to fast movements and dynamic motion, which KinPoly-v struggles to handle. This is due to the challenges in estimating the accurate velocities required to drive a pretrained imitator.

Comparison with KinPoly-v on Quest
Comparison with KinPoly-v on Aria

Failure Cases

Failure cases include: erroneous hand and foot movement due to the challenging viewing angle and occlusion. The humanoid can also stumble and drag its feet to stay balanced. From time to time, we can also observe micro movement in the humanoid's hands due to imprecise pose reasoning. Some out-of-the-distribution motion such as push ups can also cause the humanoid to fall down.

Real-world Data Failure Cases
Synthetic Data Failure Cases