Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs

1University of Wisconsin-Madison, 2Meta, 3UIUC
CVPR 2025
Teaser image

(Right) Our VideoMindPalace represents video data as a layered, topological structured graph, where nodes capture spatial concepts (e.g., objects, activity zones, rooms), and edges signify spatiotemporal, layout relationships and human-object interaction. This graph can be represented in JSON format and used as input to text-only LLMs. (Center) VideoTree extracts query-relevant information by organizing videos as tree structures, with deeper branches capturing finer, query-specific details. A captioner then generates video descriptions from this structure, enabling the LLM to perform reasoning over long videos. (Left) processes videos following temporal order, where visual captioners sequentially generate textual descriptions within each temporal sliding window, which the LLM then aggregates for reasoning.

Abstract

Long-form video understanding with Large Vision Language Models is challenged by the need to analyze temporally dispersed yet spatially concentrated key moments within limited context windows. In this work, we introduce VideoMindPalace, a new framework inspired by the “Mind Palace”, which organizes critical video moments into a topologically structured semantic graph. VideoMindPalace organizes key information through (i) hand-object tracking and interaction, (ii) clustered activity zones representing specific areas of recurring activities, and (iii) environment layout mapping, allowing natural language parsing by LLMs to provide grounded insights on spatio-temporal and 3D context. In addition, we propose the Video Mind Palace Benchmark (VMB), to assess human-like reasoning, including spatial localization, temporal reasoning, and layout-aware sequential understanding. Evaluated on VMB and established video QA datasets, including EgoSchema, NExT-QA, IntentQA, and the Active Memories Benchmark, VideoMindPalace demonstrates notable gains in spatio temporal coherence and human-aligned reasoning, advancing long-form video analysis capabilities in VLMs.

Method

Method diagram
Overview of our VideoMindPalace framework. 1) VideoMindPalace is a three-layered graph with nodes representing spatial concepts (e.g., objects, zones, rooms) and edges capturing spatiotemporal relationships. Layer 1 - Human and Object: Nodes represent the human, and detected objects, with edges denoting spatiotemporal connections and interactions. Layer 2 - Activity Zones: Nodes represent specific activity zones with edges showing 3D spatial relationships. Layer 3 - Scene Layout: Nodes represent rooms with edges for relative distances. 2) This graph can be represented in the JSON format used as input to LLMs. The model’s responses are grounded in the physical scene, enabling it to identify locations, locate items of interest, and understand the topological structure of the space.

Video MindPalace Benchmark

Method diagram
Qualitative results of VideoMindPalace on the VMB benchmark, with an example for each question type. To explore how VideoMindPalace successfully answers these questions, we prompt GPT-4 to identify the specific parts of the graph that provide sufficient information to answer each question accurately.

Main Results

Method diagram
Performance on Standard VQA benchmarks, our Video MindPlace Benchmark and Active Memories Benchmark.

Demo

Method diagram
Graph representation of a 30-second video using our Video MindPalace, which organizes information across three semantic layers.

Ablations

Image 1
Comparison of performance between temporal window segmentation and location-based clustering across different video lengths (Short, Medium, and Long).
Image 2
Ablation of each component in our pipeline on the EgoSchema/VMB benchmark by replacing tools with weaker or stronger alternatives.
Image 3
The table compares token counts for representing videos of various lengths. LLovi uses a sliding window approach, resulting in significantly higher token counts as video length increases, whereas our graph achieves a more concise representation, particularly for longer videos.

BibTeX

@article{huang2025building,
  title={Building a Mind Palace: Structuring Environment-Grounded Semantic Graphs for Effective Long Video Analysis with LLMs},
  author={Huang, Zeyi and Ji, Yuyang and Wang, Xiaofang and Mehta, Nikhil and Xiao, Tong and Lee, Donghyun and Vanvalkenburgh, Sigmund and Zha, Shengxin and Lai, Bolin and Yu, Licheng and others},
  journal={arXiv preprint arXiv:2501.04336},
  year={2025}
}