Repository logo
 

Improving Video Highlight Detection via Unsupervised Learning and Test-Time Adaptation

Date

2025-06-24

Journal Title

Journal ISSN

Volume Title

Publisher

ORCID

0000-0001-8246-632X

Type

Thesis

Degree Level

Masters

Abstract

With the exponential surge of video content driven by the ubiquity of digital cameras and ever-expanding social networks, there is a pressing need for efficient automated methods to access, manage, and curate video data. Consequently, video highlight detection has emerged as an area of research with significant traction in the computer vision community. The objective of video highlight detection is to automatically extract key segments from a long input video to create a short highlight video containing the most important and exciting moments. This technology is highly beneficial for enhancing user engagement and streamlining video content curation by providing quick access to the most significant moments in the video. In this thesis, we tackle two major limitations of current video highlight detection approaches to advance the state of the art. First, most existing methods depend heavily on expensive, manually annotated frame-level highlight labels for supervised training. To overcome this bottleneck, we focus on unsupervised video highlight detection, thereby eliminating reliance on costly human annotations. We propose a novel unsupervised audio-visual highlight detection framework that exploits inherent recurring patterns within both audio and visual modalities of videos as self-supervisory signals to train the highlight detection model. Second, existing models often exhibit limited generalization to unseen test videos, as they rely on generic highlight detection models trained on fixed datasets. This leads to suboptimal performance on unseen test videos due to domain shifts and unique video-specific characteristics not captured during training. To address this, we introduce test-time adaptation (TTA) for video highlight detection. Specifically, we propose Highlight-TTA, a TTA framework that uses a self-supervised auxiliary task, called cross-modality hallucinations, within a meta-auxiliary training scheme to improve both adaptation and highlight detection performance on unseen test videos. Extensive experiments and ablation studies on benchmark video highlight detection datasets demonstrate the effectiveness of our unsupervised learning approach and the proposed TTA framework. We believe this thesis has the potential to broadly influence research on annotation-efficient and more generalizable techniques applicable to a wider spectrum of video understanding tasks, including video moment localization, video captioning, and activity recognition.

Description

Keywords

Video highlight detection, Unsupervised learning, Test-time adaptation

Citation

Degree

Master of Science (M.Sc.)

Department

Computer Science

Program

Computer Science

Part Of

item.page.relation.ispartofseries

DOI

item.page.identifier.pmid

item.page.identifier.pmcid