We present a universal motion representation that encompasses a comprehensive range of motor skills for physics-based humanoid control. Due to the high-dimensionality of humanoid control as well as the inherent difficulties in reinforcement learning, prior methods have focused on learning skill embeddings for a narrow range of movement styles (e.g. locomotion, game characters) from specialized motion datasets. This limited scope hampers its applicability in complex tasks. Our work closes this gap, significantly increasing the coverage of motion representation space. To achieve this, we first learn a motion imitator that can imitate all of human motion from a large, unstructured motion dataset. We then create our motion representation by distilling skills directly from the imitator. This is achieved using an encoder-decoder structure with a variational information bottleneck. Additionally, we jointly learn a prior conditioned on proprioception (humanoid's own pose and velocities) to improve model expressiveness and sampling efficiency for downstream tasks. Sampling from the prior, we can generate long, stable, and diverse human motions. Using this latent space for hierarchical RL, we show that our policies solve tasks using natural and realistic human behavior. We demonstrate the effectiveness of our motion representation by solving generative tasks and motion tracking using VR controllers.
In this section, we visualize the motion imitation result from PHC+ and PULSE (distilled from PHC+) as a sanity check. PHC+ can imitate ALL of its training data as well as recovery from fail-states such as fallen on the ground. PULSE largely inherit these abilities through online distillation.
|
|
|
|
Here we show that we can dynamically switch between random motion generation and imitation, thanks to the fail-state recovery ability of PULSE. The video on the left shows we begin with imitation, then switch to random motion sampling, and back to imitation. |
In this section, we visualize 8 humanoids together using noise sampled from the prior. We also show that we can vary the sampled motion styles by changing the variance for sampling. If using a small std (the learned prior usually computes a small variance), the sampled motion is smooth and stable. In this case, the humanoid can sometimes stand still for a long time before starting to move again. If using a bigger variance (e.g. 0.22), the motion become more erratic and energetic, and fall down more. Luckily, the humanoid has the ability to get up by sampling the recovery skill. Notice that this behavior originates from training with PHC+, which has the the ability to recover from fallen states. The getup behavior comes completely from random sampling from the prior.
|
|
We can enable inter-human collision and generate human-to-human interactions.
|
|
In this section, we compare with SOTA generative and latent space models, comparing with both kinematics-based (HuMoR) and physics-based (ASE and CALM) models. Comparing to HuMoR, our method can generate stable, long-term, and physically plausible motion, while during our experiments, more than 50% of generated motion (out of 200) for HuMoR have implausible motion. Compared to other physics-based latent space, our representation has more coverage and can generate more natural and realistic motion, even though the training data is the same.
|
|
Here we visualize sampling behavior from our latent space during training (task: reach and speed). For all downstream tasks, we use the fixed standard deviation of 0.22 during training. We can see that using our latent space as action space for hierarchical RL, the agent can sample realistic human behavior during training. |
In this section, showcase the VR controller tracking task, where we track the 6DOF pose of the two hand controllers and headset. This is a challenging task as it requires the policy to perform free-form motion tracking to match the controllers. We show that our latent space has enough coverage of the motor skill from AMASS to solve this task and can be applied to real-world captures. Input is visualized as three red dots.
|
|
Here we visualize tracking performance on the synthetic data used to train the tracker using AMASS data. Input is visualized as three red dots.
|
|
In this section, we show results on applying our method to downstream generative tasks.
On the challenging terrain traversal task, our method is able to demonstrate agile human behavior using only simple trajectory following rewarld (without using any additional adversarial rewards like in PACER). Applying ASE at 30Hz can solve this task somewhat, though the motion can be jerky. CALM could not solve this task due to the lack of a style reward. Training from scratch shows unnatural motion.
|
|
|
|
The red dot in the air indicates reach target.
|
|
The red block on the ground indicates target speed.
|
|
Here show our attempt at using a vector quantized (VQ) latent space. While after distillation, the policy can achieve high imitation success rate, it manifest micro jitters to remain standing still. This is a result of the controller switching between different discrete latent codes rapidly. One can increase the latent space size and number of codes to maybe ameliorate this behavior, but then it could defeat the purpose of using a quantized latent space as the discrete space become more and more expressive and closer to a continuous one. |