Global-local motion transformer for unsupervised skeleton-based action learning

Boeun Kim; Hyung Jin Chang; Jungho Kim; Jin-Young  Choi

doi:10.1007/978-3-031-19772-7_13

Global-local motion transformer for unsupervised skeleton-based action learning

Boeun Kim^*, Hyung Jin Chang, Jungho Kim, Jin-Young Choi

^*Corresponding author for this work

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

35 Downloads (Pure)

Abstract

We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from adjacent frames without global motion information. Thus, the model has difficulties in learning the attention globally over whole-body motions and temporally distant joints. In addition, person-to person interactions have not been considered in the model. To tackle the learning of whole-body motion, longrange temporal dynamics, and person-to-person interactions, we design a global and local attention mechanism, where, global body motions and local joint motions pay attention to each other. In addition, we propose a novel pretraining strategy, multi-interval pose displacement prediction, to learn both global and local attention in diverse time ranges. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences. Our model outperforms stateof- the-art models by notable margins in the representative benchmarks. Codes are available at https://github.com/Boeun-Kim/GL-Transformer.

Original language	English
Title of host publication	Computer Vision – ECCV 2022
Subtitle of host publication	17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV
Editors	Shai Avidan, Gabriel Brostow, Moustapha Cissé, Giovanni Maria Farinella, Tal Hassner
Publisher	Springer
Pages	209–225
Number of pages	17
Edition	1
ISBN (Electronic)	9783031197727
ISBN (Print)	9783031197710
DOIs	https://doi.org/10.1007/978-3-031-19772-7_13
Publication status	Published - 28 Oct 2022
Event	17th European Conference on Computer Vision (ECCV 2022) - Tel Aviv, Israel Duration: 24 Oct 2022 → 28 Oct 2022

Publication series

Name	Lecture Notes in Computer Science
Publisher	Springer
Volume	13664
ISSN (Print)	0302-9743
ISSN (Electronic)	1611-3349

Conference

Conference	17th European Conference on Computer Vision (ECCV 2022)
Abbreviated title	ECCV 2022
Country/Territory	Israel
City	Tel Aviv
Period	24/10/22 → 28/10/22

Access to Document

10.1007/978-3-031-19772-7_13

KimB2022Global
This version of the contribution has been accepted for publication, after peer review (when applicable) but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections. The Version of Record is available online at: http://dx.doi.org/10.1007/978-3-031-19772-7_13. Use of this Accepted Version is subject to the publisher’s Accepted Manuscript terms of use: https://www.springernature.com/gp/open-research/policies/accepted-manuscript-terms
Accepted author manuscript, 1.24 MBLicence: Other (please specify with Rights Statement)

Cite this

Kim, B., Chang, H. J., Kim, J., & Choi, J.-Y. (2022). Global-local motion transformer for unsupervised skeleton-based action learning. In S. Avidan, G. Brostow, M. Cissé, G. M. Farinella, & T. Hassner (Eds.), Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV (1 ed., pp. 209–225). (Lecture Notes in Computer Science; Vol. 13664). Springer. https://doi.org/10.1007/978-3-031-19772-7_13

Kim, Boeun ; Chang, Hyung Jin ; Kim, Jungho et al. / Global-local motion transformer for unsupervised skeleton-based action learning. Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. editor / Shai Avidan ; Gabriel Brostow ; Moustapha Cissé ; Giovanni Maria Farinella ; Tal Hassner. 1. ed. Springer, 2022. pp. 209–225 (Lecture Notes in Computer Science).

@inproceedings{6005dcaf45f841679a7dd48a4806a941,

title = "Global-local motion transformer for unsupervised skeleton-based action learning",

abstract = "We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from adjacent frames without global motion information. Thus, the model has difficulties in learning the attention globally over whole-body motions and temporally distant joints. In addition, person-to person interactions have not been considered in the model. To tackle the learning of whole-body motion, longrange temporal dynamics, and person-to-person interactions, we design a global and local attention mechanism, where, global body motions and local joint motions pay attention to each other. In addition, we propose a novel pretraining strategy, multi-interval pose displacement prediction, to learn both global and local attention in diverse time ranges. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences. Our model outperforms stateof- the-art models by notable margins in the representative benchmarks. Codes are available at https://github.com/Boeun-Kim/GL-Transformer.",

author = "Boeun Kim and Chang, {Hyung Jin} and Jungho Kim and Jin-Young Choi",

year = "2022",

month = oct,

day = "28",

doi = "10.1007/978-3-031-19772-7_13",

language = "English",

isbn = "9783031197710",

series = "Lecture Notes in Computer Science",

publisher = "Springer",

pages = "209–225",

editor = "Shai Avidan and Gabriel Brostow and Moustapha Ciss{\'e} and Farinella, {Giovanni Maria} and Tal Hassner",

booktitle = "Computer Vision – ECCV 2022",

edition = "1",

note = "17th European Conference on Computer Vision (ECCV 2022), ECCV 2022 ; Conference date: 24-10-2022 Through 28-10-2022",

}

Kim, B, Chang, HJ, Kim, J & Choi, J-Y 2022, Global-local motion transformer for unsupervised skeleton-based action learning. in S Avidan, G Brostow, M Cissé, GM Farinella & T Hassner (eds), Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. 1 edn, Lecture Notes in Computer Science, vol. 13664, Springer, pp. 209–225, 17th European Conference on Computer Vision (ECCV 2022), Tel Aviv, Israel, 24/10/22. https://doi.org/10.1007/978-3-031-19772-7_13

Global-local motion transformer for unsupervised skeleton-based action learning. / Kim, Boeun; Chang, Hyung Jin; Kim, Jungho et al.
Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. ed. / Shai Avidan; Gabriel Brostow; Moustapha Cissé; Giovanni Maria Farinella; Tal Hassner. 1. ed. Springer, 2022. p. 209–225 (Lecture Notes in Computer Science; Vol. 13664).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Global-local motion transformer for unsupervised skeleton-based action learning

AU - Kim, Boeun

AU - Chang, Hyung Jin

AU - Kim, Jungho

AU - Choi, Jin-Young

PY - 2022/10/28

Y1 - 2022/10/28

N2 - We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from adjacent frames without global motion information. Thus, the model has difficulties in learning the attention globally over whole-body motions and temporally distant joints. In addition, person-to person interactions have not been considered in the model. To tackle the learning of whole-body motion, longrange temporal dynamics, and person-to-person interactions, we design a global and local attention mechanism, where, global body motions and local joint motions pay attention to each other. In addition, we propose a novel pretraining strategy, multi-interval pose displacement prediction, to learn both global and local attention in diverse time ranges. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences. Our model outperforms stateof- the-art models by notable margins in the representative benchmarks. Codes are available at https://github.com/Boeun-Kim/GL-Transformer.

AB - We propose a new transformer model for the task of unsupervised learning of skeleton motion sequences. The existing transformer model utilized for unsupervised skeleton-based action learning is learned the instantaneous velocity of each joint from adjacent frames without global motion information. Thus, the model has difficulties in learning the attention globally over whole-body motions and temporally distant joints. In addition, person-to person interactions have not been considered in the model. To tackle the learning of whole-body motion, longrange temporal dynamics, and person-to-person interactions, we design a global and local attention mechanism, where, global body motions and local joint motions pay attention to each other. In addition, we propose a novel pretraining strategy, multi-interval pose displacement prediction, to learn both global and local attention in diverse time ranges. The proposed model successfully learns local dynamics of the joints and captures global context from the motion sequences. Our model outperforms stateof- the-art models by notable margins in the representative benchmarks. Codes are available at https://github.com/Boeun-Kim/GL-Transformer.

U2 - 10.1007/978-3-031-19772-7_13

DO - 10.1007/978-3-031-19772-7_13

M3 - Conference contribution

SN - 9783031197710

T3 - Lecture Notes in Computer Science

SP - 209

EP - 225

BT - Computer Vision – ECCV 2022

A2 - Avidan, Shai

A2 - Brostow, Gabriel

A2 - Cissé, Moustapha

A2 - Farinella, Giovanni Maria

A2 - Hassner, Tal

PB - Springer

T2 - 17th European Conference on Computer Vision (ECCV 2022)

Y2 - 24 October 2022 through 28 October 2022

ER -

Kim B, Chang HJ, Kim J, Choi JY. Global-local motion transformer for unsupervised skeleton-based action learning. In Avidan S, Brostow G, Cissé M, Farinella GM, Hassner T, editors, Computer Vision – ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part IV. 1 ed. Springer. 2022. p. 209–225. (Lecture Notes in Computer Science). doi: 10.1007/978-3-031-19772-7_13

Global-local motion transformer for unsupervised skeleton-based action learning

Abstract

Publication series

Conference

Access to Document

Fingerprint

Cite this