Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Frances A. Laureano De Leon; Harish Tayyar Madabushi; Mark Lee

doi:10.48550/arXiv.2403.04872

Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Frances A. Laureano De Leon, Harish Tayyar Madabushi, Mark Lee

Computer Science

Research output: Working paper/Preprint › Preprint

6 Downloads (Pure)

Abstract

Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.

Original language	English
Publisher	arXiv
Pages	1-13
Number of pages	13
DOIs	https://doi.org/10.48550/arXiv.2403.04872
Publication status	Published - 7 Mar 2024

Bibliographical note

Accepted for publication at Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Data and code available at https://github.com/francesita/code-mixed-probes

Keywords

cs.CL

Access to Document

10.48550/arXiv.2403.04872

2403.04872v1

Cite this

@techreport{347e2f668f994d0ea5db026bd2a5efc9,

title = "Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text",

abstract = "Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.",

keywords = "cs.CL",

author = "Leon, {Frances A. Laureano De} and Madabushi, {Harish Tayyar} and Mark Lee",

note = "Accepted for publication at Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Data and code available at https://github.com/francesita/code-mixed-probes",

year = "2024",

month = mar,

day = "7",

doi = "10.48550/arXiv.2403.04872",

language = "English",

pages = "1--13",

publisher = "arXiv",

type = "WorkingPaper",

institution = "arXiv",

}

TY - UNPB

T1 - Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

AU - Leon, Frances A. Laureano De

AU - Madabushi, Harish Tayyar

AU - Lee, Mark

N1 - Accepted for publication at Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Data and code available at https://github.com/francesita/code-mixed-probes

PY - 2024/3/7

Y1 - 2024/3/7

N2 - Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.

AB - Code-switching is a prevalent linguistic phenomenon in which multilingual individuals seamlessly alternate between languages. Despite its widespread use online and recent research trends in this area, research in code-switching presents unique challenges, primarily stemming from the scarcity of labelled data and available resources. In this study we investigate how pre-trained Language Models handle code-switched text in three dimensions: a) the ability of PLMs to detect code-switched text, b) variations in the structural information that PLMs utilise to capture code-switched text, and c) the consistency of semantic information representation in code-switched text. To conduct a systematic and controlled evaluation of the language models in question, we create a novel dataset of well-formed naturalistic code-switched text along with parallel translations into the source languages. Our findings reveal that pre-trained language models are effective in generalising to code-switched text, shedding light on the abilities of these models to generalise representations to CS corpora. We release all our code and data including the novel corpus at https://github.com/francesita/code-mixed-probes.

KW - cs.CL

UR - https://lrec-coling-2024.org/

U2 - 10.48550/arXiv.2403.04872

DO - 10.48550/arXiv.2403.04872

M3 - Preprint

SP - 1

EP - 13

BT - Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

PB - arXiv

ER -

Code-Mixed Probes Show How Pre-Trained Models Generalise On Code-Switched Text

Abstract

Bibliographical note

Keywords

Access to Document

Fingerprint

Cite this