HyperAIHyperAI

Command Palette

Search for a command to run...

GameplayQA: A Benchmarking Framework for Decision-Dense POV-Synced Multi-Video Understanding of 3D Virtual Agents

Yunzhe Wang Runhui Xu Kexin Zheng Tianyi Zhang Jayavibhav Niranjan Kogundi Soham Hans Volkan Ustun

Abstract

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments, from robotics to virtual worlds. These applications require agents to perceive rapid state changes, attribute actions to the correct entities, and reason about concurrent multi-agent behaviors from a first-person perspective, capabilities that existing benchmarks do not adequately evaluate. We introduce GameplayQA, a framework for evaluating agentic-centric perception and reasoning through video understanding. Specifically, we densely annotate multiplayer 3D gameplay videos at 1.22 labels/second, with time-synced, concurrent captions of states, actions, and events structured around a triadic system of Self, Other Agents, and the World, a natural decomposition for multi-agent environments. From these annotations, we refined 2.4K diagnostic QA pairs organized into three levels of cognitive complexity, accompanied by a structured distractor taxonomy that enables fine-grained analysis of where models hallucinate. Evaluation of frontier MLLMs reveals a substantial gap from human performance, with common failures in temporal and cross-video grounding, agent-role attribution, and handling the decision density of the game. We hope GameplayQA stimulates future research at the intersection of embodied AI, agentic perception, and world modeling.

One-sentence Summary

Researchers from the University of Southern California introduce GAMEPLAYQA, a novel benchmark densely annotating multiplayer 3D gameplay to evaluate multimodal LLMs on agentic perception. This framework uniquely employs a triadic Self-Other-World system to expose critical failures in temporal grounding and agent attribution that prior benchmarks miss.

Key Contributions

  • The paper introduces GAMEPLAYQA, a framework for evaluating agentic-centric perception and reasoning that densely annotates multiplayer 3D gameplay videos at 1.22 labels per second with time-synced captions structured around a triadic system of Self, Other Agents, and the World.
  • This work presents 2.4K diagnostic QA pairs organized into three levels of cognitive complexity and a structured distractor taxonomy designed to enable fine-grained analysis of where models hallucinate.
  • Experiments demonstrate a substantial performance gap between frontier MLLMs and human benchmarks, revealing specific failures in temporal grounding, agent-role attribution, and handling the high decision density of the game environment.

Introduction

Multimodal LLMs are increasingly deployed as perceptual backbones for autonomous agents in 3D environments like robotics and virtual worlds, where they must track rapid state changes and reason about concurrent multi-agent behaviors from a first-person perspective. Existing benchmarks fail to evaluate these capabilities because they rely on slow-paced, passive observations that lack the high-frequency decision loops and dense embodiment required for real-world agency. To address this gap, the authors introduce GAMEPLAYQA, a framework that densely annotates multiplayer 3D gameplay videos with time-synced captions structured around a triadic system of Self, Other Agents, and the World. They refine these annotations into 2.4K diagnostic QA pairs organized by cognitive complexity and a structured distractor taxonomy to pinpoint specific failure modes such as temporal grounding errors and agent-role attribution mistakes.

Dataset

  • Dataset Composition and Sources The authors introduce GAMEPLAYQA, a benchmark derived from synchronized gameplay footage of 9 commercial multiplayer games spanning diverse genres. Data sources include YouTube, Twitch streams, and existing datasets. For multi-POV scenarios, the team identified streamer groups playing in the same match and manually aligned their individual recordings to create temporally synchronized video sets.

  • Key Details for Each Subset The final benchmark consists of 2,365 high-quality QA pairs generated from 2,709 true labels and 1,586 distractor labels across 2,219.41 seconds of footage. The data is organized around a Self-Other-World entity decomposition covering six primitive types: Self-Action, Self-State, Other-Action, Other-State, World-Object, and World-Event. Questions are categorized into three cognitive levels: Level 1 for basic perception, Level 2 for temporal reasoning, and Level 3 for cross-video understanding.

  • Data Usage and Generation Strategy The authors employ a combinatorial template-based algorithm to generate the dataset, initially producing 399,214 candidate pairs before downsampling to 4,000 to ensure balanced category coverage. This process systematically combines verified labels across five dimensions including video count, entity type, and distractor type. The resulting benchmark is used to evaluate MLLMs on fine-grained hallucination analysis, with distractors specifically designed to diagnose failures in lexical, temporal, role, or cross-video reasoning.

  • Processing and Annotation Workflow The pipeline utilizes a dense multi-track timeline captioning approach with a decision density of approximately 1.22 labels per second. Annotation follows a two-stage human-in-the-loop workflow where Gemini-3-Pro generates initial candidates that graduate student annotators verify and refine. The process includes a language prior filtering step where questions solvable without video input are removed, followed by human evaluation to resolve ambiguities and ensure semantic accuracy.

Method

The authors establish a structured methodology for gameplay question answering, beginning with a detailed question taxonomy. Refer to the taxonomy diagram which outlines five orthogonal dimensions that define each question: Number of Videos, Context Target, Entity Types, Distractors, and Question Form. This hierarchical structure allows for the systematic generation of diverse query types ranging from simple identification to complex temporal localization.

Central to the entity definition is the Self-Other-World perspective. As shown in the figure below, this framework categorizes entities into three groups: the player (Self), teammates or NPCs (Other), and the game environment (World). This tripartite division ensures that questions cover a broad spectrum of gameplay interactions and observations.

The dataset construction employs a combinatorial QA generation pipeline. Refer to the generation pipeline diagram which details the process from temporal event annotation to final question templating. The system extracts semantic segments such as Self Action and World Objects, then applies template codes to formulate multiple-choice questions. To maintain data integrity, a rigorous Quality Assurance phase is implemented. This includes an automated Language Priors Filter to eliminate questions solvable without visual input, alongside Human Evaluation to validate the answer key and timeline accuracy. For model evaluation, the authors utilize an LLM as a Judge to parse and extract the selected option from the model's raw output, ensuring consistent scoring across different benchmark models.

Experiment

  • Evaluation of 16 open-source and proprietary MLLMs on GAME-PLAYQA validates a three-level cognitive hierarchy where performance consistently degrades from basic perception to temporal reasoning and cross-video understanding.
  • Experiments identify Occurrence Count and Cross-Video Ordering as critical bottlenecks, confirming that current architectures struggle with sustained temporal attention and aligning events across multiple perspectives.
  • Error analysis reveals that models handle static visual inputs better than temporal or cross-video distractors, with performance dropping significantly in fast-paced, decision-dense environments and when tracking other agents.
  • Ablation studies demonstrate that genuine visual grounding is essential for task success, while temporal ordering is specifically critical for higher-level reasoning tasks but less so for basic perception.
  • Cross-domain transfer experiments on autonomous driving and human collaboration datasets confirm that the benchmark framework generalizes to real-world spatiotemporal tasks while preserving relative difficulty rankings and model performance trends.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing

HyperAI Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp