Country-Level Arabic
Dialect Classification
How Indiana University researchers used massive transformer models to identify 18 distinct Arabic dialects in tweets.
Task
18 Countries
Datasets
NADI + MADAR
Best Result
70.22 F1 Score
The Problem Space
INTRODUCTIONArabic is spoken in over 20 countries, yet its regional variations (dialects) differ significantly from Modern Standard Arabic (MSA). Identifying these dialects in "noisy" social media text is crucial for sentiment analysis and machine translation.
"Tweets are often a mix of MSA and dialectal markers, making country-level classification a high-complexity challenge."
Noisy Data
Code Switching
18 Dialects
Neural Adv.
Diacritics RemovalRemoves vowel marks like Tashkeel that are often inconsistent in informal writing.
Eliminating punctuation and pronunciation marks.
Hamza NormalizationStandardizing multiple forms of 'Alef' (أ, إ, آ) into a single base form (ا).
Standardizing glottal stops and Alef variations.
Kashida RemovalRemoves the horizontal 'stretching' line used in Arabic calligraphy for aesthetic purposes.
Removing letter elongation characters.
Lam-Alif HandlingConverting special combined characters into their base constituent letters.
Splitting compound character variations.
Cleaning & CorrectionBasic regex-based cleanup for non-Arabic characters and common spelling errors.
Regex-based spelling and punctuation fixes.
READY FOR MODELING
Dialect Labels (18 Countries)
Models & Training
AraBERTv2-Large
Largest model tested (370M parameters). Best for context capture.
STATE OF THE ART
AraBERTv2-Base
Balanced efficiency and performance. Similar results to Large.
EFFICIENT CHOICE
CAMeLBERT-Mix
Pre-trained on DID and MADAR datasets specifically.
DIALECT FOCUSED
XGBoost
Tree-based Gradient Boosting
Naive Bayes
Probabilistic Classifier
Random Forest
Ensemble Bagging
SVC
Support Vector Machines
Fine-Tuning Setup
The authors used an ensemble of 5 folds using cross-validation. Transformers were trained with:
- Learning Rate: 2e-5
- Batch Size: 32
- Max Length: 128
- Optimizer: AdamW
AraBERTv2
TRANSFORMER LAYERS
Benchmark Leaderboard
The battle between Traditional ML and Neural Transformers
| Model Name | Macro F1 | Accuracy | Visual Scale |
|---|---|---|---|
| AraBERTv2-LargeTop performer on test set (70.22). | 0.710 | 0.710 | |
| CAMeLBERT-Mix | 0.710 | 0.710 | |
| AraBERTv2-Base | 0.700 | 0.700 | |
| XGBoost | 0.520 | 0.520 | |
| Naive Bayes | 0.430 | 0.410 | |
| AdaBoost | 0.180 | 0.240 |
Experiment: Does Preprocessing Always Help?
Result: Significant Improvement (+4%)
Result: Performance Drop (-2%)
Scientific Takeaways
The Challenge: Lexical Overlap
A major hurdle is the high frequency of Modern Standard Arabic (MSA) across all 18 countries. This "Code-Switching" makes it hard for models to find features unique to just one country.
Future: Multi-Task Learning
Authors suggest training on both Country and Region levels simultaneously to leverage regional commonalities while refining country-specific nuances.
Ensemble Methods
Combining AraBERT and CAMeLBERT outputs to stabilize scores.
Domain Adaptation
Further pre-training on Twitter-specific Arabic dialects.
Segmental Stemming
Testing if reducing words to roots helps bridge lexical gaps.