Mask-guided deformation adaptive network for human parsing

Aihua Mao*, Yuan Liang, Jianbo Jiao, Yongtuo Liu, Shengfeng He

*Corresponding author for this work

Research output: Contribution to journalArticlepeer-review

Abstract

Due to the challenges of densely compacted body parts, nonrigid clothing items, and severe overlap in crowd scenes, human parsing needs to focus more on multilevel feature representations compared to general scene parsing tasks. Based on this observation, we propose to introduce the auxiliary task of human mask and edge detection to facilitate human parsing. Different from human parsing, which exploits the discriminative features of each category, human mask and edge detection emphasizes the boundaries of semantic parsing regions and the difference between foreground humans and background clutter, which benefits the parsing predictions of crowd scenes and small human parts. Specifically, we extract human mask and edge labels from the human parsing annotations and train a shared encoder with three independent decoders for the three mutually beneficial tasks. Furthermore, the decoder feature maps of the human mask prediction branch are further exploited as attention maps, indicating human regions to facilitate the decoding process of human parsing and human edge detection. In addition to these auxiliary tasks, we further alleviate the problem of deformed clothing items under various human poses by tracking the deformation patterns with the deformable convolution. Extensive experiments show that the proposed method can achieve superior performance against state-of-The-Art methods on both single and multiple human parsing datasets. Codes and trained models are available https://github.com/ViktorLiang/MGDAN.

Original languageEnglish
Article number11
Number of pages20
JournalACM Transactions on Multimedia Computing, Communications and Applications
Volume18
Issue number1
DOIs
Publication statusPublished - 14 Mar 2022

Bibliographical note

Funding Information:
A. Mao and Y. Liang contributed equally to this research. This project is supported by the National Natural Science Foundation of China under Grant No.:∼61972162; Guangdong International Science and Technology Cooperation Project (No. 2021A0505030009); Guangdong Natural Science Foundation (No. 2019A1515010833, 2021A1515012625); Guangzhou Basic and Applied Research Project (No. 202102021074); the Fundamental Research Funds for the Central Universities (No. 2020ZYGXZR089); the Social Science Research Base of Guangdong Province-Research Center of Network Civilization in New Era of SCUT; and the CCF-Tencent Open Research fund under Grant No.: CCF-Tencent RAGR20190112. Authors’ addresses: A. Mao, Y. Liang, Y. Liu, and S. He (corresponding author), School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China; emails: ahmao@scut.edu.cn, yuanliang07@gmail.com, csmanlyt@mail.scut.edu.cn, hesfe@scut.edu.cn; J. Jiao, Department of Engineering Science, University of Oxford, Parks Road, Oxford, OX1 3PJ, United Kingdom; email:jianbo@robots.ox.ac.uk. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2022 Association for Computing Machinery. 1551-6857/2022/03-ART11 $15.00 https://doi.org/10.1145/3467889

Publisher Copyright:
© 2022 Association for Computing Machinery.

Keywords

  • deformable convolution
  • Human parsing
  • multi-Task learning

ASJC Scopus subject areas

  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Mask-guided deformation adaptive network for human parsing'. Together they form a unique fingerprint.

Cite this