MovieChat+: Elevating Zero-Shot Long Video Understanding to New Heights

Synced
SyncedReview
Published in
3 min readMay 1, 2024

--

In recent advancements, the fusion of video foundation models and large language models has emerged as a promising avenue for constructing robust video understanding systems, transcending the constraints of predefined vision tasks. However, while these methods exhibit commendable performance on shorter videos, they encounter significant hurdles when confronted with longer video sequences. The escalating computational complexity and memory demands inherent in sustaining long-term temporal connections pose formidable challenges.

In a new paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering, a pioneering research group introduces MovieChat, a novel framework tailored to accommodate extensive video durations exceeding 10,000 frames. This innovative system achieves unprecedented performance in deciphering prolonged video content.

The team outlines their pivotal contributions as follows:

  1. Introduction of MovieChat: MovieChat represents the inaugural framework expressly crafted to support the analysis of protracted videos, leveraging pre-trained Multi-Modal Language Models (MLLMs) and employing a zero-shot, training-free memory consolidation mechanism.
  2. Enhancement with MovieChat+: Building upon the foundation of MovieChat, the upgraded version, MovieChat+, refines memory efficiency by introducing a vision-question matching-based memory consolidation technique. This enhancement not only eclipses the performance of the initial iteration but also outshines prevailing benchmarks in both short and long video question-answering tasks.
  3. Launch of MovieChat-1K Benchmark: The research group releases the pioneering long-video understanding benchmark, MovieChat-1K, featuring an expanded temporal label set of 2,000 compared to its precursor. Rigorous quantitative assessments and comprehensive case studies substantiate the comparable performance of both understanding capacity and inference costs.

MovieChat employs a sliding window mechanism to extract video features, subsequently encoding them into token representations. These tokens are sequentially integrated into the short-term memory frame by frame. Upon reaching the predetermined threshold, the earliest tokens are amalgamated and consolidated into the long-term memory.

The proposed methodology incorporates two distinctive inference modes: the global mode, relying exclusively on the long-term memory, and the breakpoint mode, which incorporates the current short-term memory alongside the long-term memory, facilitating focused video comprehension at specific temporal junctures. Following projection, the video representation interfaces with a large language model to engage with users effectively.

Furthermore, the team introduces MovieChat+, wherein they refine the vision-question matching-based memory consolidation mechanism to more effectively align predictions of visual language models with relevant visual cues.

MovieChat represents a significant breakthrough in tackling the challenges associated with analyzing extended video sequences, achieving state-of-the-art performance in long video comprehension. Its prowess eclipses existing systems constrained to handling videos with fewer frames, signaling a paradigm shift in video understanding technology.

The code is available on project’s GitHub. The paper MovieChat+: Question-aware Sparse Memory for Long Video Question Answering is on arXiv.

Author: Hecate He | Editor: Chain Zhang

We know you don’t want to miss any news or research breakthroughs. Subscribe to our popular newsletter Synced Global AI Weekly to get weekly AI updates.

--

--

Synced
SyncedReview

AI Technology & Industry Review — syncedreview.com | Newsletter: http://bit.ly/2IYL6Y2 | Share My Research http://bit.ly/2TrUPMI | Twitter: @Synced_Global