IUEXIST: Multilingual Pre-trained Language Models for Sexism Detection on Twitter
Indiana University, Bloomington, IN, USA
What Is Sexism Detection?
Sexism detection is a natural language processing task that aims to automatically identify discriminatory, demeaning, or harmful content targeting people based on sex or gender. On social platforms like Twitter, such content ranges from subtle stereotyping to overt harassment.
This paper presents IUEXIST, a system submitted to the EXIST 2023 Shared Task 1 at CLEF — a multilingual binary classification challenge covering English and Spanish tweets.
Why Is Twitter Hard?
- Very short texts — fewer signals per token
- Non-standard language, abbreviations, emojis
- Sarcasm, irony, and implicit references
- Cross-lingual content (English + Spanish)
- Annotator subjectivity and disagreement
Example Tweets
Hover over each card to see the annotation rationale.
Dataset Composition
The core dataset is the EXIST 2023 training set. The team also incorporated additional data from the EXIST 2021 and EXIST 2022 shared tasks to expand coverage.
| Source | Language | Count |
|---|---|---|
| EXIST 2023 | Spanish | 3,660 |
| EXIST 2023 | English | 3,260 |
| EXIST 2021/2022 | Both (extra) | ~2,040 |
| Total (final training) | 8,960 | |
Class Distribution
Annotation Pipeline
The team developed a five-step cleaning pipeline. Notably, hashtags were kept after an initial experiment showed removing them destroys signal.
Original Tweet
RT @john: Women shouldn't be in tech! 😡 Check this out: https://t.co/abc123 #womenintech #STEM
After Pre-processing
USER Women shouldnt be in tech enraged face URL womenintech STEM
Traditional Machine Learning Classifiers
All traditional classifiers use TF-IDF or count-based feature representations, implemented with scikit-learn.
XGBoost Best Hyperparameters
| Max Depth | 128 |
| Learning Rate | 0.1 |
| Estimators | 200 |
| Random Seed | 47 |
| Eval Metric | logloss |
Transformer-based Models (via HuggingFace)
All transformers are fine-tuned using HuggingFace AutoTrain. Input lengths are optimized per model.
Transformer Classification Flow
raw text
subword tokens
encoder layers
representation
linear layer
Non-Sexist
Ensemble: Four Transformers + XGBoost Meta-Learner
The ensemble combines predictions from four transformer models, using XGBoost as a stacking meta-learner. Training data is the extended set (EXIST 2023 + 2021/2022).
base
large
base
large
Official Submissions (Table 2)
IUEXIST_1: XLM-RoBERTa Large (single model, official 2023 data only).
IUEXIST_2: Ensemble of 4 transformers + XGBoost (extended data).
Metric: ICM (Information Contrast Measure). Higher = better.
HARD-HARD = hard labels for both training and evaluation | SOFT-SOFT = probabilistic / soft labels throughout
| Language | Model | HH Rank | HH ICM | HH F1 | SS Rank | SS ICM | Notes |
|---|---|---|---|---|---|---|---|
| All | IUEXIST_1 | 16 | 0.5313 | 0.7734 | 9 | 0.7115 | Best in SOFT |
| All | IUEXIST_2 | 15 | 0.5341 | 0.7717 | 17 | 0.6141 | Best in HARD |
| English | IUEXIST_1 | 19 | 0.5225 | 0.7509 | 9 | 0.6802 | Best in SOFT |
| English | IUEXIST_2 | 24 | 0.5059 | 0.7419 | 19 | 0.3893 | Weaker in EN |
| Spanish | IUEXIST_1 | 16 | 0.5294 | 0.7907 | 14 | 0.7076 | |
| Spanish | IUEXIST_2 | 13 | 0.5460 | 0.7942 | 12 | 0.7479 | Best in ES |
Development Set Results (Table 3)
Hover over bars to see full metrics. Two groups: Original (no pre-processing) vs. Pre-processed.
Ensemble Variations (Table 4)
How much does each factor contribute? Comparing single model vs. ensemble vs. ensemble with extra data.
(2023 data only)
(2023 data only)
(+ 2021/22 data)
Pre-processing is not universally beneficial. Its effect depends heavily on the model architecture.
Many examples in the dataset are open to interpretation depending on cultural and personal context. The boundaries of "sexist" are inherently subjective.
Tweets that use irony or humor to discuss sexism can be misclassified — especially when the surface form appears non-sexist but the intent is harmful.
With 6 annotators per tweet, disagreement is inevitable. The 3–3 tie-breaking rule introduces a systematic bias toward the sexist class.
Adding more labeled training data from prior years improved performance far more than moving from a single transformer to a four-model ensemble.
The ensemble (IUEXIST_2) is better for hard label prediction; the single XLM-R model (IUEXIST_1) is much better for soft/probabilistic evaluation (rank 9 vs. rank 17).
No single pre-processing strategy works for all models. Transformer models pre-trained on social media data do best with raw, unprocessed tweet text.
More training data helps, but can distort class distributions. Future work should investigate data augmentation strategies that balance coverage and class balance.
Language evolves. A model trained on 2021 tweets may struggle with 2024 slang. Long-term evaluation across time is needed to measure performance degradation.
Rather than collapsing annotations to a single hard label, future systems could model annotator disagreement directly using soft labels or multi-annotator learning.