Combining Character and Word Embeddings for the Detection of Offensive Language in Arabic

Abdullah Alharbi, Mark Lee

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Twitter and other social media platforms offer users the chance to share their ideas via short posts. While the easy exchange of ideas has value, these microblogs can be leveraged by people who want to share hatred, and such individuals can share negative views about an individual, race, or group with millions of people at the click of a button. There is thus an urgent need to establish a method that can automatically identify hate speech and offensive language. To contribute to this development, during the OSACT4 workshop, a shared task was undertaken to detect offensive language in Arabic. A key challenge was the uniqueness of the language used on social media, prompting the out-of-vocabulary (OOV) problem. In addition, the use of different dialects in Arabic exacerbates this problem. To deal with the issues associated with OOV, we generated a character-level embeddings model, which was trained on a massive data collected carefully. This level of embeddings can work effectively in resolving the problem of OOV words through its ability to learn the vectors of character n-grams or parts of words. The proposed systems were ranked 7th and 8th for Subtasks A and B, respectively.
Original languageEnglish
Title of host publicationProceedings of the 4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection
Subtitle of host publicationat LREC 2020 - Language Resources and Evaluation Conference
EditorsHend Al-Khalifa, Walid Magdy, Kareem Darwish, Tamer Elsayed, Hamdy Mubarak
PublisherEuropean Language Resources Association (ELRA)
Pages91-96
ISBN (Print)9791095546511
Publication statusPublished - 12 May 2020
Event4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020) - Marseille, France
Duration: 11 May 202016 May 2020

Conference

Conference4th Workshop on Open-Source Arabic Corpora and Processing Tools, with a Shared Task on Offensive Language Detection (OSACT4 2020)
Country/TerritoryFrance
CityMarseille
Period11/05/2016/05/20

Keywords

  • character-level embeddings
  • word-level embeddings
  • Arabic offensive language detection

Cite this