A custom spaCy pipeline for Hebrew text including a transformer-based multitask NER model that recognizes 16 entity types in Hebrew, including GPE
, PER
, LOC
and ORG
.
To run the package you will need to install the package as well as the model, preferably in a virtual environment:
# Create conda env (optional)
conda create --name hebspacy python=3.8
conda activate hebsafeharbor
# Install hebspacy
pip install hebspacy
# Download and install the model (see below availbable models)
pip install </path/to/download>
Model | Description | Install URL |
---|---|---|
he_ner_news_trf | A full spaCy pipeline for Hebrew text including a multitask NER model trained against the BMC and NEMO corpora. Read more here. | Download |
import spacy
nlp = spacy.load("he_ner_news_trf")
text = """מרגלית דהן
מספר זהות 11278904-5
2/12/2001
ביקור חוזר מ18.11.2001
במסגרת בירור פלפיטציות ואי סבילות למאמצים,מנורגיות קשות ע"ר שרירנים- ביצעה מעבדה שהדגימה:
המוגלובין 9, מיקרוציטי היפוכרומטי עם RDW 19,
פריטין 10, סטורציית טרנספרין 8%.
מבחינת עומס נגיפי HIV- undetectable ומקפידה על HAART
"""
doc = nlp(text)
for entity in doc.ents:
print(f"{entity.text} \t {entity.label_}: {entity._.confidence_score:.4f} ({entity.start_char},{entity.end_char})")
>>> מרגלית דהן PERS: 0.9999 (0,10)
>>> 2/12/2001 DATE: 0.9897 (33,42)
>>> מ18.11.2001 DATE: 0.8282 (54,65)
>>> 8% PERCENT: 0.9932 (230,232)
'he_ner_news_trf' is a multitask model constructed from AlephBert and two NER focused heads, each trained against a different NER-annotated Hebrew corpus:
- NEMO corpus - annotations of the Hebrew Treebank (Haaretz newspaper) for the widely-used OntoNotes entity category:
GPE
(geo-political entity),PER
(person),LOC
(location),ORG
(organization),FAC
(facility),EVE
(event),WOA
(work-of-art),ANG
(language),DUC
(product). - BMC corpus - annotations of articles from Israeli newspapers and websites (Haaretz newspaper, Maariv newspaper, Channel 7) for the common entity categories:
PERS
(person),LOC
(location),ORG
(organization),DATE
(date),TIME
(time),MONEY
(money),PERCENT
(percent),MISC__AFF
(misc affiliation),MISC__ENT
(misc entity),MISC_EVENT
(misc event).
The model was developed and trained using the Hugging Face and PyTorch libraries, and was later integrated into a spaCy pipeline.
The output model was split into three weight files: the transformer embeddings, the BMC head, and the NEMO head.
The components were each packaged in a separate pipe and integrated into the custom pipeline.
Furthermore, a custom NER head consolidation pipe was added last to address signal conflicts/overlaps, and sets the Doc.ents
property.
To access the entities recognized by each NER head, use the Doc._.<ner_head>
property (e.g., doc._.nemo_ents
and doc._.bmc_ents
).
You are welcome to contribute to hebspacy
project and introduce new feature/ models.
Kindly follow the pipeline codebase instructions and the model training and packaging guidelines.
HebSpaCy is an open-source project developed by 8400 The Health Network.