Research

Research Projects

Building interpretable, linguistically-grounded NLP systems for Arabic and multilingual settings — from corpus construction to mechanistic analysis of transformer models.

Active Research

Mechanistic Interpretability for Arabic Speech

Active

Applying mechanistic interpretability to speech transformers to detect linguistic nativeness in Arabic. Probing internal circuits for phonological and morphological patterns to distinguish native from non-native Arabic speech. Presented at NLP @ Michigan Day 2026.

Mechanistic Interp. Speech Transformers Arabic

Multilingual NLI & Label Drift

Active

A large collaborative project (10 co-authors) examining how NLI label distributions drift when inference data undergoes machine translation across languages. Explores cross-lingual transfer reliability in Arabic and multilingual settings. Presented at MSLD 2026 and ASAL39.

NLI Machine Translation Multilingual

Alzheimer's Disease Knowledge Graph

Active

Constructing an ontology-based knowledge graph for Alzheimer's disease research using RAG pipelines. Entities and relations are extracted from biomedical literature and structured into a queryable semantic graph for clinical decision support. Presented at MSLD 2026.

Knowledge Graph RAG Biomedical NLP

Past Projects

AMWAL: NER for Arabic Financial News

Published

Built and evaluated a Named Entity Recognition system for Arabic financial news. Developed the AMWAL dataset with annotation of financial entities (organizations, persons, locations, monetary values) in Modern Standard Arabic news text.

NER Financial NLP Arabic Paper ↗

Hoosiers Arabic Ellipsis Corpus (HAEC)

Published

Created the first comprehensive Arabic ellipsis dataset (658 annotated sentences) using CQL queries on ArTenTen. Types include gapping, stripping, and sluicing. Benchmarked traditional ML (Random Forest, 63% acc.) and LLMs (Gemini 2.5 Pro, 95.6% acc.). Published in John Benjamins.

Ellipsis Corpus LLMs Paper ↗

IUNADI: Arabic Dialect Classification

Published

System submission to the NADI 2023 shared task on country-level Arabic dialect identification in tweets. Applied multilingual pre-trained models to classify 21 Arabic dialects at country level.

Dialect ID Twitter Shared Task Paper ↗

IUEXIST: Sexism Detection (EXIST2023)

Published

Multilingual system for sexism detection on Twitter using pre-trained language models. Submission to the EXIST 2023 shared task at CLEF. Evaluated on Spanish and English with multilingual cross-lingual transfer.

Hate Speech Multilingual Twitter Paper ↗

COVID-19 Arabic Twitter Discourse

Published

Two large-scale studies on Arabic Twitter discourse during the COVID-19 pandemic. Combined citation count of 96 — among the most cited work in Arabic social media NLP during 2020–2022.

Social Media Sentiment Arabic Papers ↗

Depression Narratives Corpus

Published

Corpus-based appraisal and judgment analysis of clinical narratives from mental health forums, examining syntactic and semantic patterns in major and bipolar depression patient language.

Clinical NLP Appraisal Corpus Papers ↗

Research Tools & Resources

Open tools and datasets developed as part of ongoing research.

🔭
Rasid — Arabic Twitter Corpus

A 900M+ word Arabic Twitter corpus organized year-by-year, month-by-month, and week-by-week. Supports word frequency analysis, collocation search, wildcards, and geographic visualization. Includes a Modern Standard Arabic layer from opinion articles.

900M+ words 10 years Twitter
📝
RogueTeX — Web LaTeX Editor

A web-based LaTeX editor supporting multi-file projects, cloud synchronization via Supabase, and Hugging Face Spaces compilation. Supports .tex, .bib, and image assets. Features templates for academic papers and APA style. Open to the public.

Open Source LaTeX Cloud Sync GitHub ↗
📊
Hoosiers Arabic Ellipsis Corpus (HAEC)

658 annotated sentences of Arabic ellipsis types (gapping, stripping, sluicing) extracted from the ArTenTen corpus using CQL. First comprehensive Arabic ellipsis dataset. Available on GitHub with code for ML benchmarking.

658 sentences Ellipsis MSA GitHub ↗
🌐
ASAL39 Conference Website

Designed and deployed the full conference website for the 39th Annual Symposium on Arabic Linguistics at Indiana University Bloomington. Managed registration, scheduling, and attendee communications through the platform.

Conference Web Dev ASAL39