Trans6D: transformer-based 6D object pose estimation and refinement

Zhongqun Zhang; Wei Chen; Linfang Zheng; Ales Leonardis; Hyung Jin Chang

doi:10.1007/978-3-031-25085-9_7

Trans6D: transformer-based 6D object pose estimation and refinement

Zhongqun Zhang^*, Wei Chen, Linfang Zheng, Ales Leonardis, Hyung Jin Chang

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

259 Downloads (Pure)

Abstract

Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.

Original language	English
Title of host publication	Computer Vision – ECCV 2022 Workshops
Editors	Leonid Karlinsky, Tomer Michaeli, Ko Nishino
Place of Publication	Cham
Publisher	Springer
Pages	112–128
Number of pages	17
Edition	1
ISBN (Electronic)	9783031250859
ISBN (Print)	9783031250842
DOIs	https://doi.org/10.1007/978-3-031-25085-9_7
Publication status	Published - 12 Feb 2023
Event	7th International Workshop on Recovering 6D Object Pose - Tel-Aviv, Israel Duration: 23 Oct 2022 → 23 Oct 2022

Publication series

Name	Lecture Notes in Computer Science
Publisher	Springer
Volume	13808
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Workshop

Workshop	7th International Workshop on Recovering 6D Object Pose
Country/Territory	Israel
City	Tel-Aviv
Period	23/10/22 → 23/10/22

Keywords

6D object pose estimation
Transformer

Access to Document

10.1007/978-3-031-25085-9_7

ZhangZ2023Trans6D
This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/978-3-031-25085-9_7. Use of this Accepted Version is subject to the publisher’s Accepted Manuscript terms of use https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms
Accepted author manuscript, 1.75 MBLicence: Other (please specify with Rights Statement)

BURG: Benchmarks for UndeRstanding Grasping
Leonardis, A. & Sridharan, M.
Engineering & Physical Science Research Council
1/11/19 → 31/07/23
Project: Research Councils

Cite this

@inproceedings{a14a6095af384d7998b8aa81deb1563d,

title = "Trans6D: transformer-based 6D object pose estimation and refinement",

abstract = "Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.",

keywords = "6D object pose estimation, Transformer",

author = "Zhongqun Zhang and Wei Chen and Linfang Zheng and Ales Leonardis and Chang, {Hyung Jin}",

year = "2023",

month = feb,

day = "12",

doi = "10.1007/978-3-031-25085-9_7",

language = "English",

isbn = "9783031250842",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "112–128",

editor = "Leonid Karlinsky and Tomer Michaeli and Ko Nishino",

booktitle = "Computer Vision – ECCV 2022 Workshops",

edition = "1",

note = "7th International Workshop on Recovering 6D Object Pose ; Conference date: 23-10-2022 Through 23-10-2022",

}

Zhang, Z, Chen, W, Zheng, L, Leonardis, A & Chang, HJ 2023, Trans6D: transformer-based 6D object pose estimation and refinement. in L Karlinsky, T Michaeli & K Nishino (eds), Computer Vision – ECCV 2022 Workshops. 1 edn, Lecture Notes in Computer Science, vol. 13808, Springer, Cham, pp. 112–128, 7th International Workshop on Recovering 6D Object Pose, Tel-Aviv, Israel, 23/10/22. https://doi.org/10.1007/978-3-031-25085-9_7

Trans6D: transformer-based 6D object pose estimation and refinement. / Zhang, Zhongqun; Chen, Wei; Zheng, Linfang et al.
Computer Vision – ECCV 2022 Workshops. ed. / Leonid Karlinsky; Tomer Michaeli; Ko Nishino. 1. ed. Cham: Springer, 2023. p. 112–128 (Lecture Notes in Computer Science; Vol. 13808).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Trans6D

T2 - 7th International Workshop on Recovering 6D Object Pose

AU - Zhang, Zhongqun

AU - Chen, Wei

AU - Zheng, Linfang

AU - Leonardis, Ales

AU - Chang, Hyung Jin

PY - 2023/2/12

Y1 - 2023/2/12

N2 - Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.

AB - Estimating 6D object pose from a monocular RGB image remains challenging due to factors such as texture-less and occlusion. Although convolution neural network (CNN)-based methods have made remarkable progress, they are not efficient in capturing global dependencies and often suffer from information loss due to downsampling operations. To extract robust feature representation, we propose a Transformer-based 6D object pose estimation approach (Trans6D). Specifically, we first build two transformer-based strong baselines and compare their performance: pure Transformers following the ViT (Trans6D-pure) and hybrid Transformers integrating CNNs with Transformers (Trans6D-hybrid). Furthermore, two novel modules have been proposed to make the Trans6D-pure more accurate and robust: (i) a patch-aware feature fusion module. It decreases the number of tokens without information loss via shifted windows, cross-attention, and token pooling operations, which is used to predict dense 2D-3D correspondence maps; (ii) a pure Transformer-based pose refinement module (Trans6D+) which refines the estimated poses iteratively. Extensive experiments show that the proposed approach achieves state-of-the-art performances on two datasets.

KW - 6D object pose estimation

KW - Transformer

U2 - 10.1007/978-3-031-25085-9_7

DO - 10.1007/978-3-031-25085-9_7

M3 - Conference contribution

SN - 9783031250842

T3 - Lecture Notes in Computer Science

SP - 112

EP - 128

BT - Computer Vision – ECCV 2022 Workshops

A2 - Karlinsky, Leonid

A2 - Michaeli, Tomer

A2 - Nishino, Ko

PB - Springer

CY - Cham

Y2 - 23 October 2022 through 23 October 2022

ER -

Trans6D: transformer-based 6D object pose estimation and refinement

Abstract

Publication series

Workshop

Keywords

Access to Document

Fingerprint

Projects

BURG: Benchmarks for UndeRstanding Grasping

Cite this