CLEF 2023 · EXIST Shared Task 1

IUEXIST: Multilingual Pre-trained Language Models for Sexism Detection on Twitter

Yash A. Hatekar · Muhammad S. Abdo · Snigdha Khanna · Sandra Kübler

Indiana University, Bloomington, IN, USA

SEXISM DETECTION TRANSFORMERS DEEP LEARNING PRE-PROCESSING MULTILINGUAL TWITTER NLP
01 Introduction & Motivation

What Is Sexism Detection?

Sexism detection is a natural language processing task that aims to automatically identify discriminatory, demeaning, or harmful content targeting people based on sex or gender. On social platforms like Twitter, such content ranges from subtle stereotyping to overt harassment.

This paper presents IUEXIST, a system submitted to the EXIST 2023 Shared Task 1 at CLEF — a multilingual binary classification challenge covering English and Spanish tweets.

Task definition: A tweet is labeled sexist if it (i) is itself sexist, (ii) describes a sexist situation, or (iii) criticizes a sexist behavior.

Why Is Twitter Hard?

  • Very short texts — fewer signals per token
  • Non-standard language, abbreviations, emojis
  • Sarcasm, irony, and implicit references
  • Cross-lingual content (English + Spanish)
  • Annotator subjectivity and disagreement

Example Tweets

Hover over each card to see the annotation rationale.

● Sexist
"Call me sexist but it just feels wrong that women are reffing the NBA — like go ref the WNBA."
Implicitly diminishes women's legitimacy in a professional sports role, expressing that they don't belong in a male-dominated space.
Hover for rationale ↑
● Sexist (Spanish)
"Esta gringa sigue llorando por el gamergate, que coincidencia que tenga pronombres en su perfil"
Dismisses a woman's concern about harassment in gaming (Gamergate) by mocking her gender identity, trivializing harassment experiences.
Hover for rationale ↑
✓ Non-Sexist
"Even if you get embarrassed and blush, you can still confront hard things. #KeepMoving"
General motivational content with no gender-based discrimination or stereotyping. The hashtag provides further positive context.
Hover for rationale ↑
✓ Non-Sexist (Spanish)
"Los políticos acostumbran a hablarle al pueblo como si fueran una manada de estúpidos pero la manada no hacemos nada por contradecirlos."
Political criticism targeting politicians broadly, with no reference to gender or sex-based discrimination.
Hover for rationale ↑
02 Data & Annotation

Dataset Composition

The core dataset is the EXIST 2023 training set. The team also incorporated additional data from the EXIST 2021 and EXIST 2022 shared tasks to expand coverage.

SourceLanguageCount
EXIST 2023Spanish3,660
EXIST 2023English3,260
EXIST 2021/2022Both (extra)~2,040
Total (final training)8,960

Class Distribution

Sexist: 5,593
62.4%
Non-Sexist: 3,367
37.6%

Annotation Pipeline

Raw Tweets
6 Human Annotators
Majority Vote
Final Label
⚖️ Tie-breaking Rule: When three annotators say "sexist" and three say "non-sexist" (a 3–3 tie), the tweet is labeled sexist. This design choice partly addresses the class imbalance — more sexist than non-sexist examples in the dataset.
🌐 How tweets were collected: Over 400 popular expressions and terms commonly used to undermine women's roles in society (in English and Spanish) were used as search terms for collecting the corpus.
03 Pre-processing Pipeline

The team developed a five-step cleaning pipeline. Notably, hashtags were kept after an initial experiment showed removing them destroys signal.

1
Replace URLs
All hyperlinks are replaced with the token URL, removing noise while preserving tweet structure.
2
Remove Retweet Marker
The prefix RT (retweet) is removed — it carries no semantic meaning for sexism classification.
3
Anonymize Usernames
@mentions are replaced with USER to remove personally identifiable references and reduce sparsity.
4
Emoji → Text Conversion
Emojis are converted to their textual descriptions using the Python emoji library. Example: 😡 → :enraged_face:
5
Remove Non-alphanumeric Characters
All special characters are stripped except apostrophes (') and spaces, preserving contractions.

Original Tweet

RT @john: Women shouldn't be in tech! 😡 Check this out: https://t.co/abc123 #womenintech #STEM

After Pre-processing

USER Women shouldnt be in tech enraged face URL womenintech STEM

⚠️ Why Hashtags Were Kept: A tweet like "#Catcalling is #Harassment. It's Not a Compliment." is labeled sexist by annotators. If hashtags are stripped, all meaningful content disappears. Hashtags encode critical topical signals and must be preserved.
04 Models & Training

Traditional Machine Learning Classifiers

All traditional classifiers use TF-IDF or count-based feature representations, implemented with scikit-learn.

Multinomial Naive Bayes
Default parameters · bag-of-words
Support Vector Machine
Default parameters · linear kernel
XGBoost
Tuned via 5-fold GridSearchCV

XGBoost Best Hyperparameters

Max Depth128
Learning Rate0.1
Estimators200
Random Seed47
Eval Metriclogloss

Transformer-based Models (via HuggingFace)

All transformers are fine-tuned using HuggingFace AutoTrain. Input lengths are optimized per model.

DistilBERT
128 tokens
RoBERTa base
128 tokens
XLM-RoBERTa base
95 tokens (optimal)
XLM-RoBERTa large
128 tokens
TwHIN-BERT base
128 tokens · Twitter-trained
TwHIN-BERT large
128 tokens · Twitter-trained

Transformer Classification Flow

Tweet
raw text
Tokenizer
subword tokens
Transformer
encoder layers
[CLS]
representation
Classifier Head
linear layer
Sexist /
Non-Sexist
Why TwHIN-BERT? TwHIN-BERT is a multilingual BERT variant pre-trained on a massive Twitter graph (heterogeneous information network), making it especially well-suited for capturing social and linguistic patterns on Twitter.

Ensemble: Four Transformers + XGBoost Meta-Learner

The ensemble combines predictions from four transformer models, using XGBoost as a stacking meta-learner. Training data is the extended set (EXIST 2023 + 2021/2022).

XLM-RoBERTa
base
XLM-RoBERTa
large
TwHIN-BERT
base
TwHIN-BERT
large
↓ softmax outputs (probabilities) ↓
XGBoost Meta-Learner (stacking)
SEXIST
NON-SEXIST
Why ensemble? Each model specializes differently — XLM-RoBERTa excels at multilingual semantics, while TwHIN-BERT captures Twitter-specific social patterns. Combining them via a learned meta-classifier can capture complementary signals.
05 Results & Comparisons

Official Submissions (Table 2)

IUEXIST_1: XLM-RoBERTa Large (single model, official 2023 data only).
IUEXIST_2: Ensemble of 4 transformers + XGBoost (extended data).

Metric: ICM (Information Contrast Measure). Higher = better.

HARD-HARD = hard labels for both training and evaluation  |  SOFT-SOFT = probabilistic / soft labels throughout

Language Model HH Rank HH ICM HH F1 SS Rank SS ICM Notes
AllIUEXIST_1 160.53130.7734 90.7115 Best in SOFT
AllIUEXIST_2 150.53410.7717 170.6141 Best in HARD
EnglishIUEXIST_1 190.52250.7509 90.6802 Best in SOFT
EnglishIUEXIST_2 240.50590.7419 190.3893 Weaker in EN
SpanishIUEXIST_1 160.52940.7907 140.7076
SpanishIUEXIST_2 130.54600.7942 120.7479 Best in ES
Key takeaway: IUEXIST_2 (ensemble) wins on HARD-HARD — especially in Spanish (+0.017 ICM). But IUEXIST_1 (single XLM-R large) dominates SOFT-SOFT, achieving rank 9 overall vs. rank 17 for the ensemble.

Development Set Results (Table 3)

Hover over bars to see full metrics. Two groups: Original (no pre-processing) vs. Pre-processed.

Without Pre-processing
Multinomial NB
ICM: 0.1354 | F1+: 0.6785 | macroF1: 0.6729
0.1354
SVM
ICM: 0.2738 | F1+: 0.7108 | macroF1: 0.7180
0.2738
XGBoost
ICM: 0.3220 | F1+: 0.7273 | macroF1: 0.7332
0.3220
DistilBERT
ICM: 0.3822 | F1+: 0.7427 | macroF1: 0.7522
0.3822
RoBERTa base
ICM: 0.4233 | F1+: 0.7479 | macroF1: 0.7650
0.4233
TwHIN base
ICM: 0.5130 | F1+: 0.7856 | macroF1: 0.7934
0.5130
TwHIN large
ICM: 0.5377 | F1+: 0.7884 | macroF1: 0.8014
0.5377
XLM-R large
ICM: 0.5547 | F1+: 0.7965 | macroF1: 0.8067
0.5547
XLM-R base ★
ICM: 0.5716 | F1+: 0.8025 | macroF1: 0.8120 — Best single model
0.5716
Ensemble 🏆
ICM: 0.5873 | F1+: 0.8054 | macroF1: 0.8171 — BEST OVERALL
0.5873

Ensemble Variations (Table 4)

How much does each factor contribute? Comparing single model vs. ensemble vs. ensemble with extra data.

0.5547
XLM-R Large
(2023 data only)
0.5634
Ensemble
(2023 data only)
0.5873
Ensemble
(+ 2021/22 data)
💡 Key insight: Going from a single model to an ensemble (same data) adds +0.0087 ICM. Adding the extra training data then adds another +0.0239 ICM — nearly three times the gain from ensembling alone. More data beats more models.
06 Pre-processing Impact

Pre-processing is not universally beneficial. Its effect depends heavily on the model architecture.

Multinomial Naive Bayes
Original → 0.1354
0.1354
Pre-processed → 0.1446 ▲ +0.0092
0.1446
RoBERTa base
Original → 0.4233
0.4233
Pre-processed → 0.4833 ▲ +0.0600
0.4833
SVM
Original → 0.2738
0.2738
Pre-processed → 0.2377 ▼ −0.0361
0.2377
XGBoost
Original → 0.3220
0.3220
Pre-processed → 0.2638 ▼ −0.0582
0.2638
XLM-RoBERTa base
Original → 0.5716
0.5716
Pre-processed → 0.5167 ▼ −0.0549
0.5167
⚠️ Architecture matters: Simpler bag-of-words models like Naive Bayes and count-based RoBERTa benefit from pre-processing (cleaner tokens = better TF-IDF). But transformer models that were pre-trained on raw social media text (like XLM-RoBERTa and TwHIN-BERT) already "understand" emojis, URLs, and non-standard language — stripping these features actually hurts performance.
07 Discussion & Future Work
⚠️
Challenge
Ambiguous Definitions
Many examples in the dataset are open to interpretation depending on cultural and personal context. The boundaries of "sexist" are inherently subjective.
⚠️
Challenge
Sarcasm & Irony
Tweets that use irony or humor to discuss sexism can be misclassified — especially when the surface form appears non-sexist but the intent is harmful.
⚠️
Challenge
Annotator Disagreement
With 6 annotators per tweet, disagreement is inevitable. The 3–3 tie-breaking rule introduces a systematic bias toward the sexist class.
💡
Insight
Data Quantity > Model Complexity
Adding more labeled training data from prior years improved performance far more than moving from a single transformer to a four-model ensemble.
💡
Insight
HARD vs. SOFT Trade-off
The ensemble (IUEXIST_2) is better for hard label prediction; the single XLM-R model (IUEXIST_1) is much better for soft/probabilistic evaluation (rank 9 vs. rank 17).
💡
Insight
Pre-processing Is Model-Dependent
No single pre-processing strategy works for all models. Transformer models pre-trained on social media data do best with raw, unprocessed tweet text.
🔬
Future Work
More Data, Carefully
More training data helps, but can distort class distributions. Future work should investigate data augmentation strategies that balance coverage and class balance.
🔬
Future Work
Temporal Drift
Language evolves. A model trained on 2021 tweets may struggle with 2024 slang. Long-term evaluation across time is needed to measure performance degradation.
🔬
Future Work
Handling Disagreement
Rather than collapsing annotations to a single hard label, future systems could model annotator disagreement directly using soft labels or multi-annotator learning.
08 References & Credits