AMWAL: Named Entity Recognition
for Arabic Financial News
The first domain-specific NER system for Arabic financial text — built from 26K articles, grounded in FIBO ontology, and achieving state-of-the-art performance across 20 entity types.
Indiana University Bloomington
Why Arabic Financial NER?
Financial news drives markets — predicting stock movements, measuring sentiment, informing investor decisions. Yet the overwhelming majority of NLP tools for financial text are English-only, leaving an enormous gap for Arabic, a language with 400M+ speakers and major financial centers from Morocco to the Gulf.
Even within Arabic NLP, existing NER systems are generic — built to recognize people, organizations, and countries. No domain-specific system existed for the financial domain until AMWAL.
What Makes Financial Arabic NER Hard?
The "Manufacturing" Problem — Orthographic Explosion
A single English word like "manufacturing" can produce many unpredictable Arabic transliterations — each a valid but distinct orthographic form:
The "Nissan" Ambiguity
Identical spelling — different meaning depending on context. A key source of CORPORATION entity errors.
Three Sources, 23 Years
Source Breakdown
Pre-processing Steps
Normalization reduces orthographic variation and prevents the model from treating the same token as different words:
① Remove All Diacritics
Strips vowel marks (harakat) so الزَّواج → الزواج
② Normalize Hamza
All hamza variants → canonical form. Prevents إ / أ / آ / ا from counting as distinct tokens
③ Remove Kashida (Tatweel)
Strips decorative letter elongation: مبـلغ → مبلغ
Click any entity tile to explore it
Click an entity above to see its count, example tokens, and annotation challenges.
Entity Count Distribution — Figure 1
The Two-Query Extraction Strategy
Using TXM (Textometry), two corpus queries were used to extract entities that are inherently labeled — the query structure itself provides the annotation context:
Query Pattern 1 — Hypernym–Hyponym
[Hypernym such as Hyponym]
Query Pattern 2 — Coordination
[Hyponym X and Hyponym Y]
Full Pipeline
Training Configuration
Figure 2: System-Level Comparison
Cross-Language Financial NER Context
These cross-language comparisons are contextual only, not direct benchmarks (different languages and corpora).
Per-Entity Performance — Hover to Compare
The table below shows F1 scores across all 20 entities for all three systems. Green = AMWAL wins; blue = competitive; orange = lower; gray = zero.
| Entity | AMWAL | CamelBERT | Wojood | ||||||
|---|---|---|---|---|---|---|---|---|---|
| — | P | R | F1 | P | R | F1 | P | R | F1 |
Figure 3: AMWAL vs CamelBERT — F1 by Entity
Points above the diagonal = AMWAL wins. Financial-domain entities (BANK, METRIC, STOCK EXCHANGE) show the largest gains.
CORPORATION — Lowest F1 (81)
Company ↔ Product overlap
يوروميد للصناعات الطبية
Euromed for Medical Industries — contains "Medical Industries" which can read as PRODUCT OR SERVICE
Company ↔ Nationality overlap
ويند إيطاليا
Wind Italy — contains "Italy" (NATIONALITY)
Company ↔ Month ambiguity
نيسان ← نيسان
Nissan (car brand) vs. April (month) — identical spelling in Arabic
PERSON — F1 (80)
Name ↔ Nationality overlap
السويدي
"The Swedish" — could be a person's nisba surname OR a nationality label
High-Confidence Entity Types
Where Baselines Completely Fail — AMWAL's Biggest Margins
AMWAL is the first NER system specifically designed and trained for the Arabic financial domain. Its 20-category schema derived from FIBO is more comprehensive than any prior Arabic NER system.
AMWAL's 95.97 F1 vs. CamelBERT's 91.00 and Wojood's 80.00 confirms that domain-specific training yields significant gains over general-purpose systems — even when the baseline uses the same backbone (AraBERT).
AMWAL uniquely handles FINANCIAL MARKET, STOCK EXCHANGE, and GOVERNMENT ENTITY — all scoring 0 on both baselines. Domain specificity is the sole reason for coverage.
At 95.97 F1, AMWAL outperforms financial NER systems in Chinese (92), French (73), Turkish (~80), and German (~88) — despite those languages having far more NLP resources than Arabic.
AMWAL is trained on Modern Standard Arabic from formal newspapers. It will not generalize to dialectal Arabic, social media text, or informal financial blogs without fine-tuning.
Corporation, Person, and Nationality categories are the most error-prone due to inherent ambiguity in Arabic naming conventions and transliterated entity names.
High performance partly reflects strong tagging of entities seen during training. Generalization to genuinely novel entity mentions (zero-shot) remains a challenge shared with all NER systems.
Restructure flat 20-type schema into FIBO-aligned hierarchies (e.g., BANK as subtype of FINANCIAL INSTITUTION), enabling more nuanced representation.
The stated ultimate goal: extend from entity recognition to relation extraction, building a full Arabic financial KG to serve investors, regulators, and intelligence analysts.
Expand training set with more ambiguous/overlapping category examples and apply augmentation to improve robustness on Corporation and Person edge cases.
https://github.com/Muhsabrys/AMWAL/