Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

¹University of Wisconsin-Madison, ²Meta, ³UIUC

CVPR 2025

Abstract

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the “Mind Palace”, which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video Mind Palace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Method

Overview of our VideoMindPalace framework. 1) VideoMindPalace is a three-layered graph with nodes representing spatial concepts (e.g., objects, zones, rooms) and edges capturing spatiotemporal relationships. Layer 1 - Human and Object: Nodes represent the human, and detected objects, with edges denoting spatiotemporal connections and interactions. Layer 2 - Activity Zones: Nodes represent specific activity zones with edges showing 3D spatial relationships. Layer 3 - Scene Layout: Nodes represent rooms with edges for relative distances. 2) This graph can be represented in the JSON format used as input to LLMs. The model’s responses are grounded in the physical scene, enabling it to identify locations, locate items of interest, and understand the topological structure of the space.

Video MindPalace Benchmark

Qualitative results of VideoMindPalace on the VMB benchmark, with an example for each question type. To explore how VideoMindPalace successfully answers these questions, we prompt GPT-4 to identify the specific parts of the graph that provide sufficient information to answer each question accurately.

Ablations

Comparison of performance between temporal window segmentation and location-based clustering across different video lengths (Short, Medium, and Long).

Ablation of each component in our pipeline on the EgoSchema/VMB benchmark by replacing tools with weaker or stronger alternatives.

The table compares token counts for representing videos of various lengths. LLovi uses a sliding window approach, resulting in significantly higher token counts as video length increases, whereas our graph achieves a more concise representation, particularly for longer videos.

BibTeX

@article{huang2025building, title={Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs}, author={Huang, Zeyi and Ji, Yuyang and Wang, Xiaofang and Mehta, Nikhil and Xiao, Tong and Lee, Donghyun and Vanvalkenburgh, Sigmund and Zha, Shengxin and Lai, Bolin and Yu, Licheng and others}, journal={arXiv preprint arXiv:2501.04336}, year={2025} }