Revitalizing CNN attentions via transformers in self-supervised visual representation learning

Chongjian Ge; Youwei Liang; Yibing Song; Jianbo Jiao; Jue Wang; Ping Luo

Revitalizing CNN attentions via transformers in self-supervised visual representation learning

Chongjian Ge, Youwei Liang, Yibing Song, Jianbo Jiao, Jue Wang, Ping Luo

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

14 Downloads (Pure)

Abstract

Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

Original language	English
Title of host publication	Advances in Neural Information Processing Systems 34 (NeurIPS 2021)
Editors	Marc'Aurelio Ranzato, Alina Beygelzimer, Yann Dauphin, Percy S. Liang, Jenn Wortman Vaughan
Publisher	NeurIPS
Pages	4193-4206
Number of pages	14
ISBN (Print)	9781713845393
Publication status	Published - 11 Oct 2021
Event	Thirty-fifth Conference on Neural Information Processing Systems - Virtual Duration: 6 Dec 2021 → 14 Dec 2021

Publication series

Name	Advances in Neural Information Processing Systems
Volume	34
ISSN (Print)	1049-5258

Conference

Conference	Thirty-fifth Conference on Neural Information Processing Systems
Abbreviated title	NeurIPS 2021
Period	6/12/21 → 14/12/21

Bibliographical note

Funding Information:
Acknowledgement. This work is supported by CCF-Tencent Open Fund, the General Research Fund of Hong Kong No.27208720 and the EPSRC Programme Grant Visual AI EP/T028572/1.

Publisher Copyright:
© 2021 Neural information processing systems foundation. All rights reserved.

ASJC Scopus subject areas

Computer Networks and Communications
Information Systems
Signal Processing

Access to Document

GeC2021Revitalizing Accepted author manuscript, 1.08 MBLicence: None: All rights reserved

https://papers.nips.cc/paper/2021/hash/21be992eb8016e541a15953eee90760e-Abstract.htmlLicence: None: All rights reserved

Cite this

Ge, C., Liang, Y., Song, Y., Jiao, J., Wang, J., & Luo, P. (2021). Revitalizing CNN attentions via transformers in self-supervised visual representation learning. In MA. Ranzato, A. Beygelzimer, Y. Dauphin, P. S. Liang, & J. Wortman Vaughan (Eds.), Advances in Neural Information Processing Systems 34 (NeurIPS 2021) (pp. 4193-4206). (Advances in Neural Information Processing Systems; Vol. 34). NeurIPS. https://papers.nips.cc/paper/2021/hash/21be992eb8016e541a15953eee90760e-Abstract.html

Ge, Chongjian ; Liang, Youwei ; Song, Yibing et al. / Revitalizing CNN attentions via transformers in self-supervised visual representation learning. Advances in Neural Information Processing Systems 34 (NeurIPS 2021). editor / Marc'Aurelio Ranzato ; Alina Beygelzimer ; Yann Dauphin ; Percy S. Liang ; Jenn Wortman Vaughan. NeurIPS, 2021. pp. 4193-4206 (Advances in Neural Information Processing Systems).

@inproceedings{5772d99f47c548f3b1cf763c9151e7d1,

title = "Revitalizing CNN attentions via transformers in self-supervised visual representation learning",

abstract = "Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance. ",

author = "Chongjian Ge and Youwei Liang and Yibing Song and Jianbo Jiao and Jue Wang and Ping Luo",

note = "Funding Information: Acknowledgement. This work is supported by CCF-Tencent Open Fund, the General Research Fund of Hong Kong No.27208720 and the EPSRC Programme Grant Visual AI EP/T028572/1. Publisher Copyright: {\textcopyright} 2021 Neural information processing systems foundation. All rights reserved.; Thirty-fifth Conference on Neural Information Processing Systems, NeurIPS 2021 ; Conference date: 06-12-2021 Through 14-12-2021",

year = "2021",

month = oct,

day = "11",

language = "English",

isbn = "9781713845393",

series = "Advances in Neural Information Processing Systems",

publisher = "NeurIPS",

pages = "4193--4206",

editor = "Marc'Aurelio Ranzato and Alina Beygelzimer and Yann Dauphin and Liang, {Percy S.} and {Wortman Vaughan}, Jenn",

booktitle = "Advances in Neural Information Processing Systems 34 (NeurIPS 2021)",

}

Ge, C, Liang, Y, Song, Y, Jiao, J, Wang, J & Luo, P 2021, Revitalizing CNN attentions via transformers in self-supervised visual representation learning. in MA Ranzato, A Beygelzimer, Y Dauphin, PS Liang & J Wortman Vaughan (eds), Advances in Neural Information Processing Systems 34 (NeurIPS 2021). Advances in Neural Information Processing Systems, vol. 34, NeurIPS, pp. 4193-4206, Thirty-fifth Conference on Neural Information Processing Systems, 6/12/21. <https://papers.nips.cc/paper/2021/hash/21be992eb8016e541a15953eee90760e-Abstract.html>

Revitalizing CNN attentions via transformers in self-supervised visual representation learning. / Ge, Chongjian; Liang, Youwei; Song, Yibing et al.
Advances in Neural Information Processing Systems 34 (NeurIPS 2021). ed. / Marc'Aurelio Ranzato; Alina Beygelzimer; Yann Dauphin; Percy S. Liang; Jenn Wortman Vaughan. NeurIPS, 2021. p. 4193-4206 (Advances in Neural Information Processing Systems; Vol. 34).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Revitalizing CNN attentions via transformers in self-supervised visual representation learning

AU - Ge, Chongjian

AU - Liang, Youwei

AU - Song, Yibing

AU - Jiao, Jianbo

AU - Wang, Jue

AU - Luo, Ping

N1 - Funding Information: Acknowledgement. This work is supported by CCF-Tencent Open Fund, the General Research Fund of Hong Kong No.27208720 and the EPSRC Programme Grant Visual AI EP/T028572/1. Publisher Copyright: © 2021 Neural information processing systems foundation. All rights reserved.

PY - 2021/10/11

Y1 - 2021/10/11

N2 - Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

AB - Studies on self-supervised visual representation learning (SSL) improve encoder backbones to discriminate training samples without labels. While CNN encoders via SSL achieve comparable recognition performance to those via supervised learning, their network attention is under-explored for further improvement. Motivated by the transformers that explore visual attention effectively in recognition scenarios, we propose a CNN Attention REvitalization (CARE) framework to train attentive CNN encoders guided by transformers in SSL. The proposed CARE framework consists of a CNN stream (C-stream) and a transformer stream (T-stream), where each stream contains two branches. C-stream follows an existing SSL framework with two CNN encoders, two projectors, and a predictor. T-stream contains two transformers, two projectors, and a predictor. T-stream connects to CNN encoders and is in parallel to the remaining C-Stream. During training, we perform SSL in both streams simultaneously and use the T-stream output to supervise C-stream. The features from CNN encoders are modulated in T-stream for visual attention enhancement and become suitable for the SSL scenario. We use these modulated features to supervise C-stream for learning attentive CNN encoders. To this end, we revitalize CNN attention by using transformers as guidance. Experiments on several standard visual recognition benchmarks, including image classification, object detection, and semantic segmentation, show that the proposed CARE framework improves CNN encoder backbones to the state-of-the-art performance.

UR - http://www.scopus.com/inward/record.url?scp=85125033697&partnerID=8YFLogxK

M3 - Conference contribution

SN - 9781713845393

T3 - Advances in Neural Information Processing Systems

SP - 4193

EP - 4206

BT - Advances in Neural Information Processing Systems 34 (NeurIPS 2021)

A2 - Ranzato, Marc'Aurelio

A2 - Beygelzimer, Alina

A2 - Dauphin, Yann

A2 - Liang, Percy S.

A2 - Wortman Vaughan, Jenn

PB - NeurIPS

T2 - Thirty-fifth Conference on Neural Information Processing Systems

Y2 - 6 December 2021 through 14 December 2021

ER -

Ge C, Liang Y, Song Y, Jiao J, Wang J, Luo P. Revitalizing CNN attentions via transformers in self-supervised visual representation learning. In Ranzato MA, Beygelzimer A, Dauphin Y, Liang PS, Wortman Vaughan J, editors, Advances in Neural Information Processing Systems 34 (NeurIPS 2021). NeurIPS. 2021. p. 4193-4206. (Advances in Neural Information Processing Systems).

Revitalizing CNN attentions via transformers in self-supervised visual representation learning

Abstract

Publication series

Conference

Bibliographical note

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this