Building interpretable, linguistically-grounded NLP systems for Arabic and multilingual settings — from corpus construction to mechanistic analysis of transformer models.
Applying mechanistic interpretability to speech transformers to detect linguistic nativeness in Arabic. Probing internal circuits for phonological and morphological patterns to distinguish native from non-native Arabic speech. Presented at NLP @ Michigan Day 2026.
A large collaborative project (10 co-authors) examining how NLI label distributions drift when inference data undergoes machine translation across languages. Explores cross-lingual transfer reliability in Arabic and multilingual settings. Presented at MSLD 2026 and ASAL39.
Constructing an ontology-based knowledge graph for Alzheimer's disease research using RAG pipelines. Entities and relations are extracted from biomedical literature and structured into a queryable semantic graph for clinical decision support. Presented at MSLD 2026.
Built and evaluated a Named Entity Recognition system for Arabic financial news. Developed the AMWAL dataset with annotation of financial entities (organizations, persons, locations, monetary values) in Modern Standard Arabic news text.
Created the first comprehensive Arabic ellipsis dataset (658 annotated sentences) using CQL queries on ArTenTen. Types include gapping, stripping, and sluicing. Benchmarked traditional ML (Random Forest, 63% acc.) and LLMs (Gemini 2.5 Pro, 95.6% acc.). Published in John Benjamins.
System submission to the NADI 2023 shared task on country-level Arabic dialect identification in tweets. Applied multilingual pre-trained models to classify 21 Arabic dialects at country level.
Multilingual system for sexism detection on Twitter using pre-trained language models. Submission to the EXIST 2023 shared task at CLEF. Evaluated on Spanish and English with multilingual cross-lingual transfer.
Two large-scale studies on Arabic Twitter discourse during the COVID-19 pandemic. Combined citation count of 96 — among the most cited work in Arabic social media NLP during 2020–2022.
Corpus-based appraisal and judgment analysis of clinical narratives from mental health forums, examining syntactic and semantic patterns in major and bipolar depression patient language.
Open tools and datasets developed as part of ongoing research.
A 900M+ word Arabic Twitter corpus organized year-by-year, month-by-month, and week-by-week. Supports word frequency analysis, collocation search, wildcards, and geographic visualization. Includes a Modern Standard Arabic layer from opinion articles.
A web-based LaTeX editor supporting multi-file projects, cloud synchronization via Supabase, and Hugging Face Spaces compilation. Supports .tex, .bib, and image assets. Features templates for academic papers and APA style. Open to the public.
658 annotated sentences of Arabic ellipsis types (gapping, stripping, sluicing) extracted from the ArTenTen corpus using CQL. First comprehensive Arabic ellipsis dataset. Available on GitHub with code for ML benchmarking.
Designed and deployed the full conference website for the 39th Annual Symposium on Arabic Linguistics at Indiana University Bloomington. Managed registration, scheduling, and attendee communications through the platform.