Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment

Nanxuan Zhao; Jianbo Jiao; Weidi Xie; Dahua Lin

doi:10.1109/CVPRW59228.2023.00672

Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment

Nanxuan Zhao^*, Jianbo Jiao, Weidi Xie, Dahua Lin

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE 1. Our method first estimates confidence scores for measuring the correspondence between video and text descriptions, and the scores are later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval.

Original language	English
Title of host publication	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)
Publisher	IEEE
Pages	6317-6327
Number of pages	11
ISBN (Electronic)	9798350302493
ISBN (Print)	9798350302509
DOIs	https://doi.org/10.1109/CVPRW59228.2023.00672
Publication status	Published - 14 Aug 2023
Event	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition - Vancouver, Canada Duration: 18 Jun 2023 → 22 Jun 2023

Publication series

Name	IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops
Publisher	IEEE
ISSN (Print)	2160-7508
ISSN (Electronic)	2160-7516

Conference

Conference	2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition
Abbreviated title	CVPR 2023
Country/Territory	Canada
City	Vancouver
Period	18/06/23 → 22/06/23

Bibliographical note

Publisher Copyright:
© 2023 IEEE.

Keywords

Training
Weight measurement
Representation learning
Semantics
Training data
Predictive models
Predictive coding

ASJC Scopus subject areas

Computer Vision and Pattern Recognition
Electrical and Electronic Engineering

Access to Document

10.1109/CVPRW59228.2023.00672

Cite this

Zhao, N., Jiao, J., Xie, W., & Lin, D. (2023). Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) (pp. 6317-6327). Article 10209009 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops). IEEE. https://doi.org/10.1109/CVPRW59228.2023.00672

@inproceedings{83180d206f6d470fa1c7bcd56b182e8e,

title = "Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment",

abstract = "With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE 1. Our method first estimates confidence scores for measuring the correspondence between video and text descriptions, and the scores are later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval.",

keywords = "Training, Weight measurement, Representation learning, Semantics, Training data, Predictive models, Predictive coding",

author = "Nanxuan Zhao and Jianbo Jiao and Weidi Xie and Dahua Lin",

note = "Publisher Copyright: {\textcopyright} 2023 IEEE.; 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2023 ; Conference date: 18-06-2023 Through 22-06-2023",

year = "2023",

month = aug,

day = "14",

doi = "10.1109/CVPRW59228.2023.00672",

language = "English",

isbn = "9798350302509",

series = "IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops",

publisher = "IEEE",

pages = "6317--6327",

booktitle = "2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)",

}

Zhao, N, Jiao, J, Xie, W & Lin, D 2023, Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment. in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)., 10209009, IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops, IEEE, pp. 6317-6327, 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, British Columbia, Canada, 18/06/23. https://doi.org/10.1109/CVPRW59228.2023.00672

Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment. / Zhao, Nanxuan; Jiao, Jianbo; Xie, Weidi et al.
2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW). IEEE, 2023. p. 6317-6327 10209009 (IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Cali-NCE

T2 - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition

AU - Zhao, Nanxuan

AU - Jiao, Jianbo

AU - Xie, Weidi

AU - Lin, Dahua

PY - 2023/8/14

Y1 - 2023/8/14

N2 - With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE 1. Our method first estimates confidence scores for measuring the correspondence between video and text descriptions, and the scores are later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval.

AB - With the large-scale video-text datasets being collected, learning general visual-textual representation has gained increasing attention. While recent methods are designed with the assumption that the alt-text description naturally conveys the meaning and context of the video in semantics (i.e. well aligned with each other), it is unlikely to be satisfied for the Internet data, which potentially harms the quality of the learned visual-textual representation. To address this challenge, we first revisit three mainstream approaches: correspondence modeling, contrastive learning and predictive coding, demonstrating that a simple co-training strategy with these methods leads to a clear improvement in performance. To further explore the complementary nature of different training strategies, we propose a simple yet effective joint training framework that factorizes the total objective into conditional ones, termed as Cali-NCE 1. Our method first estimates confidence scores for measuring the correspondence between video and text descriptions, and the scores are later used to calibrate the sample weightings during contrastive training. Through extensive experiments, we show that the proposed approach achieves state-of-the-art performance on multiple downstream tasks: text-to-video retrieval, video action recognition, and video retrieval.

KW - Training

KW - Weight measurement

KW - Representation learning

KW - Semantics

KW - Training data

KW - Predictive models

KW - Predictive coding

UR - http://www.scopus.com/inward/record.url?scp=85170823913&partnerID=8YFLogxK

U2 - 10.1109/CVPRW59228.2023.00672

DO - 10.1109/CVPRW59228.2023.00672

M3 - Conference contribution

AN - SCOPUS:85170823913

SN - 9798350302509

T3 - IEEE Computer Society Conference on Computer Vision and Pattern Recognition workshops

SP - 6317

EP - 6327

BT - 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)

PB - IEEE

Y2 - 18 June 2023 through 22 June 2023

ER -

Cali-NCE: Boosting Cross-modal Video Representation Learning with Calibrated Alignment

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this