Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models

Alaa Alharbi, Mark Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

User-generated Social Media (SM) content has been explored as a valuable and accessible source of data about crises to enhance situational awareness and support humanitarian response efforts. However, the timely extraction of crisis-related SM messages is challenging as it involves processing large quantities of noisy data in real-time. Supervised machine learning methods have been successfully applied to this task but such approaches require human-labelled data, which are unlikely to be available from novel and emerging crises. Supervised machine learning algorithms trained on labelled data from past events did not usually perform well when classifying a new disaster due to data variations across events. Using the BERT embeddings, we propose and investigate an instance distance-based data selection approach for adaptation to improve classifiers’ performance under a domain shift. The K-nearest neighbours algorithm selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data. We demonstrated that our approach generally works better than the self-training adaptation method. Combing the self-training with our proposed classifier does not enhance the performance.

Original languageEnglish
Title of host publicationProceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
EditorsHend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Walid Magdy, Kareem Darwish
PublisherEuropean Language Resources Association (ELRA)
Pages71-78
Number of pages8
ISBN (Electronic)9791095546757
Publication statusPublished - Jun 2022
Event5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022 - Marseille, France
Duration: 20 Jun 202225 Jun 2022

Publication series

NameInternational Conference on Language Resources and Evaluation (2022)

Conference

Conference5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022
Country/TerritoryFrance
CityMarseille
Period20/06/2225/06/22

Bibliographical note

Publisher Copyright:
© European Language Resources Association (ELRA).

Keywords

  • BERT
  • Crisis Detection
  • Data Selection
  • Domain Adaptation
  • Self-training

ASJC Scopus subject areas

  • Language and Linguistics
  • Education
  • Library and Information Sciences
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models'. Together they form a unique fingerprint.

Cite this