AMWAL — Interactive Paper Summary

01Introduction & Motivation

▼

Why Arabic Financial NER?

Financial news drives markets — predicting stock movements, measuring sentiment, informing investor decisions. Yet the overwhelming majority of NLP tools for financial text are English-only, leaving an enormous gap for Arabic, a language with 400M+ speakers and major financial centers from Morocco to the Gulf.

Even within Arabic NLP, existing NER systems are generic — built to recognize people, organizations, and countries. No domain-specific system existed for the financial domain until AMWAL.

The gap AMWAL fills: The first NER system specifically designed to extract financial entities from Arabic financial news, trained on a domain corpus, grounded in an international financial ontology (FIBO), and evaluated against state-of-the-art general Arabic NER systems.

What Makes Financial Arabic NER Hard?

🔠

Orthographic Variation

Diacritics, hamza, kashida create multiple spellings of the same token

🌐

Transliteration Chaos

Foreign company names have unpredictable, non-standardized Arabic renderings

🔀

Category Ambiguity

Company names overlap with months, nationalities, product names

The "Manufacturing" Problem — Orthographic Explosion

A single English word like "manufacturing" can produce many unpredictable Arabic transliterations — each a valid but distinct orthographic form:

تصنيعArabic translation

مانيوفاكتشرنجTransliteration v1

مانيوفاكتشرنغTransliteration v2

مانيوفكتشرنجTransliteration v3

صناعةAlternative

Implication: Rule-based systems that enumerate surface forms fail catastrophically. AMWAL adopts a corpus-driven lexical approach that captures entities as they naturally occur, regardless of which orthographic variant is used.

The "Nissan" Ambiguity

As a corporation

نيسان

Nissan (automotive brand)

In Levantine Arabic

نيسان

April (month name)

Identical spelling — different meaning depending on context. A key source of CORPORATION entity errors.

02Corpus Construction

▼

Three Sources, 23 Years

26,231

Total articles collected

9.8M

Tokens in raw corpus

2000–2023

Time span covered

17,185

Annotated financial entities

Source Breakdown

Almal News — 11,012 articles (42%)

Al-Sharq — 8,106 articles (31%)

Aljazeera (Business) — 2,627 articles (10%)

Other / unclassified — 4,486 (17%)

Pre-processing Steps

Normalization reduces orthographic variation and prevents the model from treating the same token as different words:

① Remove All Diacritics

Strips vowel marks (harakat) so الزَّواج → الزواج

② Normalize Hamza

All hamza variants → canonical form. Prevents إ / أ / آ / ا from counting as distinct tokens

③ Remove Kashida (Tatweel)

Strips decorative letter elongation: مبـلغ → مبلغ

These normalizations follow Hatekar & Abdo (2023) for consistency across the lab's Arabic NLP pipeline.

0320 Entity Types — Interactive Explorer

▼

Ontology-grounded selection: Entities were not chosen arbitrarily — they derive from the Financial Industry Business Ontology (FIBO), supplemented with domain-relevant additions: BANK, GEOPOLITICAL, METRIC, STOCK EXCHANGE, MEDIA, and FINANCIAL MARKET.

Click any entity tile to explore it

Click an entity above to see its count, example tokens, and annotation challenges.

Entity Count Distribution — Figure 1

04Semi-Automated Annotation & Training

▼

The Two-Query Extraction Strategy

Using TXM (Textometry), two corpus queries were used to extract entities that are inherently labeled — the query structure itself provides the annotation context:

Query Pattern 1 — Hypernym–Hyponym

[Hypernym such as Hyponym]

بنوك مثل البنك الإسلامي الفلسطيني

"Banks such as Palestine Islamic Bank" → BANK labeled automatically

أدوات مالية مثل الأسهم

"Financial instruments such as stocks" → FINANCIAL INSTRUMENT labeled

Query Pattern 2 — Coordination

[Hyponym X and Hyponym Y]

بنك القاهرة وبنك الإسكندرية

"Cairo Bank and Alexandria Bank" → both labeled BANK

منتجات البترول والكيماويات

"Petroleum and chemical products" → PRODUCT OR SERVICE

Full Pipeline

Collect Corpus

26K articles from 3 newspapers (2000–2023)

→

Preprocess

Remove diacritics, normalize hamza & kashida

→

FIBO Entity Selection

20 entity types aligned to financial ontology

→

TXM Queries

Hypernym–Hyponym + coordination patterns

→

Frequency Analysis

Top-10 unigrams + bigrams as search seeds

→

Manual Review

17,185 entities verified for accuracy

→

Train AraBERT + spaCy

80/20 split, 20K steps, dropout 0.1

Training Configuration

Model:
Large AraBERT
Batch size:
50
Dropout:
0.1
Max steps:
20,000
Early stop:
patience=1600
Train/Test:
80% / 20%
Train files:
20,984
Test files:
5,247
Hardware:
1× GPU / 64GB

Why file-level split? Splitting at the article level (not randomly at the token level) ensures no overlapping context exists between training and test sets — preventing information leakage.

Frequency threshold: Only entities occurring ≥5 times in query results were retained, filtering noise while preserving genuine financial entities.

05Results — Interactive Comparison

▼

Overall result: AMWAL achieves Precision 96.08 · Recall 95.87 · F1 95.97 — outperforming CamelBERT (91.00) and Wojood (80.00) and all cross-language financial NER systems cited in the paper.

Figure 2: System-Level Comparison

Cross-Language Financial NER Context

These cross-language comparisons are contextual only, not direct benchmarks (different languages and corpora).

Per-Entity Performance — Hover to Compare

The table below shows F1 scores across all 20 entities for all three systems. Green = AMWAL wins; blue = competitive; orange = lower; gray = zero.

Entity	AMWAL			CamelBERT			Wojood
—	P	R	F1	P	R	F1	P	R	F1

Figure 3: AMWAL vs CamelBERT — F1 by Entity

Points above the diagonal = AMWAL wins. Financial-domain entities (BANK, METRIC, STOCK EXCHANGE) show the largest gains.

06Error Analysis

▼

CORPORATION — Lowest F1 (81)

Root cause: Company names semantically overlap with other entity types — products, services, nationalities, and even time references. The model cannot always resolve this ambiguity without broader discourse context.

Company ↔ Product overlap

يوروميد للصناعات الطبية
Euromed for Medical Industries — contains "Medical Industries" which can read as PRODUCT OR SERVICE

Company ↔ Nationality overlap

ويند إيطاليا
Wind Italy — contains "Italy" (NATIONALITY)

Company ↔ Month ambiguity

نيسان ← نيسان
Nissan (car brand) vs. April (month) — identical spelling in Arabic

PERSON — F1 (80)

Root cause: Arabic personal names sometimes include embedded nationality adjectives (nisba forms), blurring the PERSON / NATIONALITY boundary.

Name ↔ Nationality overlap

السويدي
"The Swedish" — could be a person's nisba surname OR a nationality label

High-Confidence Entity Types

CURRENCY

99

TIME

99

EVENT

98

STOCK EXCHANGE

98

FIN. INSTRUMENT

97

COUNTRY

97

Where Baselines Completely Fail — AMWAL's Biggest Margins

07Key Findings, Limitations & Future Work

▼

🏆

Finding 1

First Arabic Financial NER
AMWAL is the first NER system specifically designed and trained for the Arabic financial domain. Its 20-category schema derived from FIBO is more comprehensive than any prior Arabic NER system.

📊

Finding 2

Domain Beats Generality
AMWAL's 95.97 F1 vs. CamelBERT's 91.00 and Wojood's 80.00 confirms that domain-specific training yields significant gains over general-purpose systems — even when the baseline uses the same backbone (AraBERT).

🔍

Finding 3

New Entities = Big Wins
AMWAL uniquely handles FINANCIAL MARKET, STOCK EXCHANGE, and GOVERNMENT ENTITY — all scoring 0 on both baselines. Domain specificity is the sole reason for coverage.

🌐

Finding 4

Cross-Language SOTA
At 95.97 F1, AMWAL outperforms financial NER systems in Chinese (92), French (73), Turkish (~80), and German (~88) — despite those languages having far more NLP resources than Arabic.

⚠️

Limitation 1

MSA Only
AMWAL is trained on Modern Standard Arabic from formal newspapers. It will not generalize to dialectal Arabic, social media text, or informal financial blogs without fine-tuning.

⚠️

Limitation 2

Category Overlap Errors
Corporation, Person, and Nationality categories are the most error-prone due to inherent ambiguity in Arabic naming conventions and transliterated entity names.

🧪

Limitation 3

Seen-Data Bias
High performance partly reflects strong tagging of entities seen during training. Generalization to genuinely novel entity mentions (zero-shot) remains a challenge shared with all NER systems.

🔬

Future Work 1

Hierarchical Entity Schema
Restructure flat 20-type schema into FIBO-aligned hierarchies (e.g., BANK as subtype of FINANCIAL INSTITUTION), enabling more nuanced representation.

🕸️

Future Work 2

Arabic Financial Knowledge Graph
The stated ultimate goal: extend from entity recognition to relation extraction, building a full Arabic financial KG to serve investors, regulators, and intelligence analysts.

📦

Future Work 3

Data Augmentation
Expand training set with more ambiguous/overlapping category examples and apply augmentation to improve robustness on Corporation and Person edge cases.

Open source: AMWAL's best model, training files, and test files are available on GitHub: https://github.com/Muhsabrys/AMWAL/

AMWAL: Named Entity Recognitionfor Arabic Financial News

Why Arabic Financial NER?

What Makes Financial Arabic NER Hard?

The "Manufacturing" Problem — Orthographic Explosion

The "Nissan" Ambiguity

Three Sources, 23 Years

Source Breakdown

Pre-processing Steps

① Remove All Diacritics

② Normalize Hamza

③ Remove Kashida (Tatweel)

Click any entity tile to explore it

Entity Count Distribution — Figure 1

The Two-Query Extraction Strategy

Query Pattern 1 — Hypernym–Hyponym

Query Pattern 2 — Coordination

Full Pipeline

Training Configuration

Figure 2: System-Level Comparison

Cross-Language Financial NER Context

Per-Entity Performance — Hover to Compare

Figure 3: AMWAL vs CamelBERT — F1 by Entity

CORPORATION — Lowest F1 (81)

Company ↔ Product overlap

Company ↔ Nationality overlap

Company ↔ Month ambiguity

PERSON — F1 (80)

Name ↔ Nationality overlap

High-Confidence Entity Types

Where Baselines Completely Fail — AMWAL's Biggest Margins

AMWAL: Named Entity Recognition
for Arabic Financial News