IUEXIST: Sexism Detection on Twitter — Interactive Paper Summary

01 Introduction & Motivation

▼

What Is Sexism Detection?

Sexism detection is a natural language processing task that aims to automatically identify discriminatory, demeaning, or harmful content targeting people based on sex or gender. On social platforms like Twitter, such content ranges from subtle stereotyping to overt harassment.

This paper presents IUEXIST, a system submitted to the EXIST 2023 Shared Task 1 at CLEF — a multilingual binary classification challenge covering English and Spanish tweets.

Task definition: A tweet is labeled sexist if it (i) is itself sexist, (ii) describes a sexist situation, or (iii) criticizes a sexist behavior.

Why Is Twitter Hard?

Very short texts — fewer signals per token
Non-standard language, abbreviations, emojis
Sarcasm, irony, and implicit references
Cross-lingual content (English + Spanish)
Annotator subjectivity and disagreement

Example Tweets

Hover over each card to see the annotation rationale.

● Sexist

"Call me sexist but it just feels wrong that women are reffing the NBA — like go ref the WNBA."

Implicitly diminishes women's legitimacy in a professional sports role, expressing that they don't belong in a male-dominated space.

Hover for rationale ↑

● Sexist (Spanish)

"Esta gringa sigue llorando por el gamergate, que coincidencia que tenga pronombres en su perfil"

Dismisses a woman's concern about harassment in gaming (Gamergate) by mocking her gender identity, trivializing harassment experiences.

Hover for rationale ↑

✓ Non-Sexist

"Even if you get embarrassed and blush, you can still confront hard things. #KeepMoving"

General motivational content with no gender-based discrimination or stereotyping. The hashtag provides further positive context.

Hover for rationale ↑

✓ Non-Sexist (Spanish)

"Los políticos acostumbran a hablarle al pueblo como si fueran una manada de estúpidos pero la manada no hacemos nada por contradecirlos."

Political criticism targeting politicians broadly, with no reference to gender or sex-based discrimination.

Hover for rationale ↑

02 Data & Annotation

▼

Dataset Composition

The core dataset is the EXIST 2023 training set. The team also incorporated additional data from the EXIST 2021 and EXIST 2022 shared tasks to expand coverage.

Source	Language	Count
EXIST 2023	Spanish	3,660
EXIST 2023	English	3,260
EXIST 2021/2022	Both (extra)	~2,040
Total (final training)		8,960

Class Distribution

Sexist: 5,593

62.4%

Non-Sexist: 3,367

37.6%

Annotation Pipeline

Raw Tweets

→

6 Human Annotators

→

Majority Vote

→

Final Label

⚖️ Tie-breaking Rule: When three annotators say "sexist" and three say "non-sexist" (a 3–3 tie), the tweet is labeled sexist. This design choice partly addresses the class imbalance — more sexist than non-sexist examples in the dataset.

🌐 How tweets were collected: Over 400 popular expressions and terms commonly used to undermine women's roles in society (in English and Spanish) were used as search terms for collecting the corpus.

03 Pre-processing Pipeline

▼

The team developed a five-step cleaning pipeline. Notably, hashtags were kept after an initial experiment showed removing them destroys signal.

1

Replace URLs

All hyperlinks are replaced with the token URL, removing noise while preserving tweet structure.

2

Remove Retweet Marker

The prefix RT (retweet) is removed — it carries no semantic meaning for sexism classification.

3

Anonymize Usernames

@mentions are replaced with USER to remove personally identifiable references and reduce sparsity.

4

Emoji → Text Conversion

Emojis are converted to their textual descriptions using the Python emoji library. Example: 😡 → :enraged_face:

5

Remove Non-alphanumeric Characters

All special characters are stripped except apostrophes (') and spaces, preserving contractions.

Original Tweet

RT @john: Women shouldn't be in tech! 😡 Check this out: https://t.co/abc123 #womenintech #STEM

After Pre-processing

USER Women shouldnt be in tech enraged face URL womenintech STEM

⚠️ Why Hashtags Were Kept: A tweet like "#Catcalling is #Harassment. It's Not a Compliment." is labeled sexist by annotators. If hashtags are stripped, all meaningful content disappears. Hashtags encode critical topical signals and must be preserved.

04 Models & Training

▼

Traditional Machine Learning Classifiers

All traditional classifiers use TF-IDF or count-based feature representations, implemented with scikit-learn.

Multinomial Naive Bayes

Default parameters · bag-of-words

Support Vector Machine

Default parameters · linear kernel

XGBoost

Tuned via 5-fold GridSearchCV

XGBoost Best Hyperparameters

Max Depth	128
Learning Rate	0.1
Estimators	200
Random Seed	47
Eval Metric	logloss

Transformer-based Models (via HuggingFace)

All transformers are fine-tuned using HuggingFace AutoTrain. Input lengths are optimized per model.

DistilBERT

128 tokens

RoBERTa base

128 tokens

XLM-RoBERTa base

95 tokens (optimal)

XLM-RoBERTa large

128 tokens

TwHIN-BERT base

128 tokens · Twitter-trained

TwHIN-BERT large

128 tokens · Twitter-trained

Transformer Classification Flow

Tweet
raw text

→

Tokenizer
subword tokens

→

Transformer
encoder layers

→

[CLS]
representation

→

Classifier Head
linear layer

→

Sexist /
Non-Sexist

Why TwHIN-BERT? TwHIN-BERT is a multilingual BERT variant pre-trained on a massive Twitter graph (heterogeneous information network), making it especially well-suited for capturing social and linguistic patterns on Twitter.

Ensemble: Four Transformers + XGBoost Meta-Learner

The ensemble combines predictions from four transformer models, using XGBoost as a stacking meta-learner. Training data is the extended set (EXIST 2023 + 2021/2022).

XLM-RoBERTa
base

XLM-RoBERTa
large

TwHIN-BERT
base

TwHIN-BERT
large

↓ softmax outputs (probabilities) ↓

XGBoost Meta-Learner (stacking)

↓

SEXIST

NON-SEXIST

Why ensemble? Each model specializes differently — XLM-RoBERTa excels at multilingual semantics, while TwHIN-BERT captures Twitter-specific social patterns. Combining them via a learned meta-classifier can capture complementary signals.

05 Results & Comparisons

▼

Official Submissions (Table 2)

IUEXIST_1: XLM-RoBERTa Large (single model, official 2023 data only).
IUEXIST_2: Ensemble of 4 transformers + XGBoost (extended data).

Metric: ICM (Information Contrast Measure). Higher = better.

HARD-HARD = hard labels for both training and evaluation | SOFT-SOFT = probabilistic / soft labels throughout

Language	Model	HH Rank	HH ICM	HH F1	SS Rank	SS ICM	Notes
All	IUEXIST_1	16	0.5313	0.7734	9	0.7115	Best in SOFT
All	IUEXIST_2	15	0.5341	0.7717	17	0.6141	Best in HARD
English	IUEXIST_1	19	0.5225	0.7509	9	0.6802	Best in SOFT
English	IUEXIST_2	24	0.5059	0.7419	19	0.3893	Weaker in EN
Spanish	IUEXIST_1	16	0.5294	0.7907	14	0.7076
Spanish	IUEXIST_2	13	0.5460	0.7942	12	0.7479	Best in ES

Key takeaway: IUEXIST_2 (ensemble) wins on HARD-HARD — especially in Spanish (+0.017 ICM). But IUEXIST_1 (single XLM-R large) dominates SOFT-SOFT, achieving rank 9 overall vs. rank 17 for the ensemble.

Development Set Results (Table 3)

Hover over bars to see full metrics. Two groups: Original (no pre-processing) vs. Pre-processed.

Without Pre-processing

Multinomial NB

0.1354

SVM

0.2738

XGBoost

0.3220

DistilBERT

0.3822

RoBERTa base

0.4233

TwHIN base

0.5130

TwHIN large

0.5377

XLM-R large

0.5547

XLM-R base ★

0.5716

Ensemble 🏆

0.5873

Ensemble Variations (Table 4)

How much does each factor contribute? Comparing single model vs. ensemble vs. ensemble with extra data.

0.5547

XLM-R Large
(2023 data only)

0.5634

Ensemble
(2023 data only)

0.5873

Ensemble
(+ 2021/22 data)

💡 Key insight: Going from a single model to an ensemble (same data) adds +0.0087 ICM. Adding the extra training data then adds another +0.0239 ICM — nearly three times the gain from ensembling alone. More data beats more models.

06 Pre-processing Impact

▼

Pre-processing is not universally beneficial. Its effect depends heavily on the model architecture.

Multinomial Naive Bayes

Original → 0.1354

0.1354

Pre-processed → 0.1446 ▲ +0.0092

0.1446

RoBERTa base

Original → 0.4233

0.4233

Pre-processed → 0.4833 ▲ +0.0600

0.4833

SVM

Original → 0.2738

0.2738

Pre-processed → 0.2377 ▼ −0.0361

0.2377

XGBoost

Original → 0.3220

0.3220

Pre-processed → 0.2638 ▼ −0.0582

0.2638

XLM-RoBERTa base

Original → 0.5716

0.5716

Pre-processed → 0.5167 ▼ −0.0549

0.5167

⚠️ Architecture matters: Simpler bag-of-words models like Naive Bayes and count-based RoBERTa benefit from pre-processing (cleaner tokens = better TF-IDF). But transformer models that were pre-trained on raw social media text (like XLM-RoBERTa and TwHIN-BERT) already "understand" emojis, URLs, and non-standard language — stripping these features actually hurts performance.

07 Discussion & Future Work

▼

⚠️

Challenge

Ambiguous Definitions
Many examples in the dataset are open to interpretation depending on cultural and personal context. The boundaries of "sexist" are inherently subjective.

⚠️

Challenge

Sarcasm & Irony
Tweets that use irony or humor to discuss sexism can be misclassified — especially when the surface form appears non-sexist but the intent is harmful.

⚠️

Challenge

Annotator Disagreement
With 6 annotators per tweet, disagreement is inevitable. The 3–3 tie-breaking rule introduces a systematic bias toward the sexist class.

💡

Insight

Data Quantity > Model Complexity
Adding more labeled training data from prior years improved performance far more than moving from a single transformer to a four-model ensemble.

💡

Insight

HARD vs. SOFT Trade-off
The ensemble (IUEXIST_2) is better for hard label prediction; the single XLM-R model (IUEXIST_1) is much better for soft/probabilistic evaluation (rank 9 vs. rank 17).

💡

Insight

Pre-processing Is Model-Dependent
No single pre-processing strategy works for all models. Transformer models pre-trained on social media data do best with raw, unprocessed tweet text.

🔬

Future Work

More Data, Carefully
More training data helps, but can distort class distributions. Future work should investigate data augmentation strategies that balance coverage and class balance.

🔬

Future Work

Temporal Drift
Language evolves. A model trained on 2021 tweets may struggle with 2024 slang. Long-term evaluation across time is needed to measure performance degradation.

🔬

Future Work

Handling Disagreement
Rather than collapsing annotations to a single hard label, future systems could model annotator disagreement directly using soft labels or multi-annotator learning.

08 References & Credits

▼

References

Plaza et al. (2023). Overview of EXIST 2023 – Learning with Disagreement for Sexism Identification and Characterization. CLEF 2023, Thessaloniki, Greece.
Conneau et al. (2020). Unsupervised cross-lingual representation learning at scale. ACL 2020.
Zhang et al. (2022). TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations. arXiv:2209.07562.
Kirk et al. (2023). SemEval-2023 Task 10: Explainable Detection of Online Sexism. arXiv:2303.04222.
Chen & Guestrin (2016). XGBoost: A Scalable Tree Boosting System. KDD 2016.
Sanh et al. (2019). DistilBERT: A distilled version of BERT. arXiv:1910.01108.
Liu et al. (2019). RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692.
Amigó & Delgado (2022). Evaluating extreme hierarchical multi-label classification. ACL 2022.
Rodríguez-Sánchez et al. (2021). Overview of EXIST 2021. PLN 67.
Rodríguez-Sánchez et al. (2022). Overview of EXIST 2022. PLN 69.

Credits

This work was completed as a course project in a machine learning class at Indiana University, Bloomington. The IUEXIST team's submission to EXIST 2023 achieved rank 9 out of 70 in the SOFT-SOFT evaluation (overall, all languages).

Authors: Yash A. Hatekar · Muhammad S. Abdo · Snigdha Khanna · Sandra Kübler (Indiana University). Shared task hosted at CLEF 2023, Thessaloniki, Greece.

IUEXIST: Multilingual Pre-trained Language Models for Sexism Detection on Twitter

What Is Sexism Detection?

Why Is Twitter Hard?

Example Tweets

Dataset Composition

Class Distribution

Annotation Pipeline

Original Tweet

After Pre-processing

Traditional Machine Learning Classifiers

XGBoost Best Hyperparameters

Transformer-based Models (via HuggingFace)

Transformer Classification Flow

Ensemble: Four Transformers + XGBoost Meta-Learner

Official Submissions (Table 2)

Development Set Results (Table 3)

Ensemble Variations (Table 4)

References

Credits