Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment

Nanxuan Zhao*, Jianbo Jiao, Weidi Xie, Dahua Lin

*Corresponding author for this work

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE 1. Our method first estimates confidence scores for measuring the correspondence between video and text descriptions, and the scores are later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval.

Original languageEnglish
Title of host publication2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
PublisherIEEE
Pages6317-6327
Number of pages11
ISBN (Electronic)9798350302493
ISBN (Print)9798350302509
DOIs
Publication statusPublished - 14 Aug 2023
Event2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Vancouver, Canada
Duration: 18 Jun 202322 Jun 2023

Publication series

NameIEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops
PublisherIEEE
ISSN (Print)2160-7508
ISSN (Electronic)2160-7516

Conference

Conference2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Abbreviated titleCVPR 2023
Country/TerritoryCanada
CityVancouver
Period18/06/2322/06/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

  • Training
  • Weight measurement
  • Representation learning
  • Semantics
  • Training data
  • Predictive models
  • Predictive coding

ASJC Scopus subject areas

  • Computer Vision and Pattern Recognition
  • Electrical and Electronic Engineering

Fingerprint

Dive into the research topics of 'Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment'. Together they form a unique fingerprint.

Cite this