Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models

Alaa Alharbi; Mark Lee

Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models

Alaa Alharbi, Mark Lee

Computer Science

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

Abstract

User-generated Social Media (SM) content has been explored as a valuable and accessible source of data about crises to enhance situational awareness and support humanitarian response efforts. However, the timely extraction of crisis-related SM messages is challenging as it involves processing large quantities of noisy data in real-time. Supervised machine learning methods have been successfully applied to this task but such approaches require human-labelled data, which are unlikely to be available from novel and emerging crises. Supervised machine learning algorithms trained on labelled data from past events did not usually perform well when classifying a new disaster due to data variations across events. Using the BERT embeddings, we propose and investigate an instance distance-based data selection approach for adaptation to improve classifiers’ performance under a domain shift. The K-nearest neighbours algorithm selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data. We demonstrated that our approach generally works better than the self-training adaptation method. Combing the self-training with our proposed classifier does not enhance the performance.

Original language	English
Title of host publication	Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection
Editors	Hend Al-Khalifa, Tamer Elsayed, Hamdy Mubarak, Abdulmohsen Al-Thubaity, Walid Magdy, Kareem Darwish
Publisher	European Language Resources Association (ELRA)
Pages	71-78
Number of pages	8
ISBN (Electronic)	9791095546757
Publication status	Published - Jun 2022
Event	5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022 - Marseille, France Duration: 20 Jun 2022 → 25 Jun 2022

Publication series

Name	International Conference on Language Resources and Evaluation (2022)

Conference

Conference	5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022
Country/Territory	France
City	Marseille
Period	20/06/22 → 25/06/22

Bibliographical note

Publisher Copyright:
© European Language Resources Association (ELRA).

Keywords

BERT
Crisis Detection
Data Selection
Domain Adaptation
Self-training

ASJC Scopus subject areas

Language and Linguistics
Education
Library and Information Sciences
Linguistics and Language

Access to Document

https://aclanthology.org/2022.osact-1.8/Licence: Creative Commons: Attribution-NonCommercial (CC BY-NC)

Cite this

Alharbi, A., & Lee, M. (2022). Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models. In H. Al-Khalifa, T. Elsayed, H. Mubarak, A. Al-Thubaity, W. Magdy, & K. Darwish (Eds.), Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection (pp. 71-78). (International Conference on Language Resources and Evaluation (2022)). European Language Resources Association (ELRA). https://aclanthology.org/2022.osact-1.8/

Alharbi, Alaa ; Lee, Mark. / Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models. Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection. editor / Hend Al-Khalifa ; Tamer Elsayed ; Hamdy Mubarak ; Abdulmohsen Al-Thubaity ; Walid Magdy ; Kareem Darwish. European Language Resources Association (ELRA), 2022. pp. 71-78 (International Conference on Language Resources and Evaluation (2022)).

@inproceedings{69bd93a6767040b6a87c18cd19edd5a4,

title = "Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models",

abstract = "User-generated Social Media (SM) content has been explored as a valuable and accessible source of data about crises to enhance situational awareness and support humanitarian response efforts. However, the timely extraction of crisis-related SM messages is challenging as it involves processing large quantities of noisy data in real-time. Supervised machine learning methods have been successfully applied to this task but such approaches require human-labelled data, which are unlikely to be available from novel and emerging crises. Supervised machine learning algorithms trained on labelled data from past events did not usually perform well when classifying a new disaster due to data variations across events. Using the BERT embeddings, we propose and investigate an instance distance-based data selection approach for adaptation to improve classifiers{\textquoteright} performance under a domain shift. The K-nearest neighbours algorithm selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data. We demonstrated that our approach generally works better than the self-training adaptation method. Combing the self-training with our proposed classifier does not enhance the performance.",

keywords = "BERT, Crisis Detection, Data Selection, Domain Adaptation, Self-training",

author = "Alaa Alharbi and Mark Lee",

note = "Publisher Copyright: {\textcopyright} European Language Resources Association (ELRA).; 5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022 ; Conference date: 20-06-2022 Through 25-06-2022",

year = "2022",

month = jun,

language = "English",

series = "International Conference on Language Resources and Evaluation (2022)",

publisher = "European Language Resources Association (ELRA)",

pages = "71--78",

editor = "Hend Al-Khalifa and Tamer Elsayed and Hamdy Mubarak and Abdulmohsen Al-Thubaity and Walid Magdy and Kareem Darwish",

booktitle = "Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection",

}

Alharbi, A & Lee, M 2022, Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models. in H Al-Khalifa, T Elsayed, H Mubarak, A Al-Thubaity, W Magdy & K Darwish (eds), Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection. International Conference on Language Resources and Evaluation (2022), European Language Resources Association (ELRA), pp. 71-78, 5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022, Marseille, France, 20/06/22. <https://aclanthology.org/2022.osact-1.8/>

Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models. / Alharbi, Alaa; Lee, Mark.
Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection. ed. / Hend Al-Khalifa; Tamer Elsayed; Hamdy Mubarak; Abdulmohsen Al-Thubaity; Walid Magdy; Kareem Darwish. European Language Resources Association (ELRA), 2022. p. 71-78 (International Conference on Language Resources and Evaluation (2022)).

Research output: Chapter in Book/Report/Conference proceeding › Conference contribution

TY - GEN

T1 - Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models

AU - Alharbi, Alaa

AU - Lee, Mark

N1 - Publisher Copyright: © European Language Resources Association (ELRA).

PY - 2022/6

Y1 - 2022/6

N2 - User-generated Social Media (SM) content has been explored as a valuable and accessible source of data about crises to enhance situational awareness and support humanitarian response efforts. However, the timely extraction of crisis-related SM messages is challenging as it involves processing large quantities of noisy data in real-time. Supervised machine learning methods have been successfully applied to this task but such approaches require human-labelled data, which are unlikely to be available from novel and emerging crises. Supervised machine learning algorithms trained on labelled data from past events did not usually perform well when classifying a new disaster due to data variations across events. Using the BERT embeddings, we propose and investigate an instance distance-based data selection approach for adaptation to improve classifiers’ performance under a domain shift. The K-nearest neighbours algorithm selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data. We demonstrated that our approach generally works better than the self-training adaptation method. Combing the self-training with our proposed classifier does not enhance the performance.

AB - User-generated Social Media (SM) content has been explored as a valuable and accessible source of data about crises to enhance situational awareness and support humanitarian response efforts. However, the timely extraction of crisis-related SM messages is challenging as it involves processing large quantities of noisy data in real-time. Supervised machine learning methods have been successfully applied to this task but such approaches require human-labelled data, which are unlikely to be available from novel and emerging crises. Supervised machine learning algorithms trained on labelled data from past events did not usually perform well when classifying a new disaster due to data variations across events. Using the BERT embeddings, we propose and investigate an instance distance-based data selection approach for adaptation to improve classifiers’ performance under a domain shift. The K-nearest neighbours algorithm selects a subset of multi-event training data that is most similar to the target event. Results show that fine-tuning a BERT model on a selected subset of data to classify crisis tweets outperforms a model that has been fine-tuned on all available source data. We demonstrated that our approach generally works better than the self-training adaptation method. Combing the self-training with our proposed classifier does not enhance the performance.

KW - BERT

KW - Crisis Detection

KW - Data Selection

KW - Domain Adaptation

KW - Self-training

UR - http://www.scopus.com/inward/record.url?scp=85145876299&partnerID=8YFLogxK

M3 - Conference contribution

AN - SCOPUS:85145876299

T3 - International Conference on Language Resources and Evaluation (2022)

SP - 71

EP - 78

BT - Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection

A2 - Al-Khalifa, Hend

A2 - Elsayed, Tamer

A2 - Mubarak, Hamdy

A2 - Al-Thubaity, Abdulmohsen

A2 - Magdy, Walid

A2 - Darwish, Kareem

PB - European Language Resources Association (ELRA)

T2 - 5th Workshop Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection, OSACT 2022

Y2 - 20 June 2022 through 25 June 2022

ER -

Alharbi A, Lee M. Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models. In Al-Khalifa H, Elsayed T, Mubarak H, Al-Thubaity A, Magdy W, Darwish K, editors, Proceedings of the 5th Workshop on Open-Source Arabic Corpora and Processing Tools with Shared Tasks on Qur'an QA and Fine-Grained Hate Speech Detection. European Language Resources Association (ELRA). 2022. p. 71-78. (International Conference on Language Resources and Evaluation (2022)).

Classifying Arabic Crisis Tweets using Data Selection and Pre-trained Language Models

Abstract

Publication series

Conference

Bibliographical note

Keywords

ASJC Scopus subject areas

Access to Document

Fingerprint

Cite this