VideoPhy-2: A Challenging Action-Centric Physical Commonsense Evaluation in Video Generation

2025 Research Project at UCLA NLP in collaboration with Google Research.

Project Website

Video generative models have the potential to serve as general-purpose physical world simulators, but their understanding of real-world physics remains unclear. To address this, we introduce VideoPhy2, a benchmark designed to assess physical commonsense in AI-generated videos. Featuring 200 diverse actions and curated prompts, our dataset enables human evaluation of semantic accuracy, physics adherence, and rule grounding. Our findings reveal significant limitations, with top models achieving only 22% joint performance on challenging tasks, particularly struggling with conservation laws like mass and momentum. We also introduce VideoPhy2-eval, an automated evaluation tool for scalable assessment. VideoPhy2 highlights critical gaps in current models and paves the way for future improvements in physics-aware video generation.