Improving Video Highlight Detection via Unsupervised Learning and Test-Time Adaptation
Date
2025-06-24
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
0000-0001-8246-632X
Type
Thesis
Degree Level
Masters
Abstract
With the exponential surge of video content driven by the ubiquity of digital cameras and ever-expanding social networks, there is a pressing need for efficient automated methods to access, manage, and curate video data. Consequently, video highlight detection has emerged as an area of research with significant traction in the computer vision community. The objective of video highlight detection is to automatically extract key segments from a long input video to create a short highlight video containing the most important and exciting moments. This technology is highly beneficial for enhancing user engagement and streamlining video content curation by providing quick access to the most significant moments in the video. In this thesis, we tackle two major limitations of current video highlight detection approaches to advance the state of the art. First, most existing methods depend heavily on expensive, manually annotated frame-level highlight labels for supervised training. To overcome this bottleneck, we focus on unsupervised video highlight detection, thereby eliminating reliance on costly human annotations. We propose a novel unsupervised audio-visual highlight detection framework that exploits inherent recurring patterns within both audio and visual modalities of videos as self-supervisory signals to train the highlight detection model. Second, existing models often exhibit limited generalization to unseen test videos, as they rely on generic highlight detection models trained on fixed datasets. This leads to suboptimal performance on unseen test videos due to domain shifts and unique video-specific characteristics not captured during training. To address this, we introduce test-time adaptation (TTA) for video highlight detection. Specifically, we propose Highlight-TTA, a TTA framework that uses a self-supervised auxiliary task, called cross-modality hallucinations, within a meta-auxiliary training scheme to improve both adaptation and highlight detection performance on unseen test videos. Extensive experiments and ablation studies on benchmark video highlight detection datasets demonstrate the effectiveness of our unsupervised learning approach and the proposed TTA framework. We believe this thesis has the potential to broadly influence research on annotation-efficient and more generalizable techniques applicable to a wider spectrum of video understanding tasks, including video moment localization, video captioning, and activity recognition.
Description
Keywords
Video highlight detection, Unsupervised learning, Test-time adaptation
Citation
Degree
Master of Science (M.Sc.)
Department
Computer Science
Program
Computer Science