MetaWorld-X

Hierarchical World Modeling via VLM-Orchestrated Experts for Humanoid Loco-Manipulation

Institution Name
Conference name and year

*Indicates Equal Contribution

This video showcases our visualization.

Abstract

Learning natural, stable, and compositionally generalizable whole-body control policies for humanoid robots performing simultaneous locomotion and manipulation (loco-manipulation) remains a key challenge in robotics. Existing reinforcement learning approaches often rely on a single monolithic policy, leading to cross-skill interference and motion conflicts in high-degree-of-freedom systems. To address these issues, we propose MetaWorld-X, a hierarchical framework that decomposes complex control problems into specialized expert policies (SEP). Each expert is trained under human motion priors through imitation-constrained reinforcement learning, ensuring natural and physically plausible motions. We further develop an Intelligent Routing Mechanism (IRM) supervised by a Vision-Language Model (VLM), enabling semantic-driven expert composition. Extensive experiments on Humanoid-bench demonstrate that MetaWorld-X significantly outperforms baselines in motion quality, training efficiency, and task success rates, validating the effectiveness of semantic-driven expert orchestration.

MetaWorld-X framework

The framework of MetaWorld-X.

Motion retargeting:
SEP Implementation

The specific implementation framework of the SEP module.

AMASS Dataset

H2O Replay

Unitree H1 Replay

Imitation Learning

BibTeX

@article{YourPaperKey2024,
  title={Your Paper Title Here},
  author={First Author and Second Author and Third Author},
  journal={Conference/Journal Name},
  year={2024},
  url={https://your-domain.com/your-project-page}
}