MetaVQA

A Benchmark for Embodied Scene Understanding of Vision-Language Models