Tackling Social Value Tasks with Multilingual NLP
Date
2022-12-23
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
ORCID
Type
Thesis
Degree Level
Masters
Abstract
In recent years, deep learning applications have shown promise in tackling social value tasks such as hate speech and misinformation in social media. Neural networks provide an efficient automated solution that has replaced hand-engineered systems. Existing studies that have explored building resources, e.g. datasets, models, and NLP solutions, have yielded significant performance. However, most of these systems are limited to providing solutions in only English, neglecting the bulk of hateful and misinformation content that is generated in other languages, particularly so-called low-resource languages that have a low amount of labeled or unlabeled language data for training machine learning models (e.g. Turkish). This limitation is due to the lack of a large collection of labeled or unlabeled corpora or manually crafted linguistic resources sufficient for building NLP systems in these languages.
In this thesis, we set out to explore solutions for low-resource languages to mitigate the language gap in NLP systems for social value tasks. This thesis studies two tasks. First, we show that developing an automated classifier that captures hate speech and nuances in a low-resource language variety with limited data is extremely challenging. To tackle this, we propose HateMAML, a model-agnostic meta-learning-based framework that effectively performs hate speech detection in low resource languages. The proposed method uses a self-supervision strategy to overcome the limitation of data scarcity and produces a better pre-trained model for fast adaptation to an unseen target language. Second, this thesis aims to address the research gaps in rumour detection by proposing a modification over the standard Transformer and building on a multilingual pre-trained language model to perform rumour detection in multiple languages. Specifically, our proposed model MUSCAT prioritizes the source claims in multilingual conversation threads with co-attention transformers. Both of these methods can be seen as the incorporation of efficient transfer learning methods to mitigate issues in model training with small data.
The findings yield accurate and efficient transfer learning models for low-resource languages. The results show that our proposed approaches outperform the state-of-the-art baselines in the cross-domain multilingual transfer setting. We also conduct ablation studies to analyze the characteristics of proposed solutions and provided empirical analysis outlining the challenges of data collection to performing detection tasks in multiple languages.
Description
Keywords
multilingual nlp, social nlp, deep learning, hate speech detection, rumor detection, transformers
Citation
Degree
Master of Science (M.Sc.)
Department
Computer Science
Program
Computer Science