spacy_crfsuite: CRF tagger for spaCy.

Sequence tagging with spaCy and crfsuite.

Copied from Rasa NLU.

✨ Features

Simple but tough to beat CRF entity tagger (via sklearn-crfsuite)
spaCy NER component
Command line interface for training & evaluation and example notebook
CoNLL, JSON and Markdown annotations
Pre-trained NER component

⏳ Installation

pip install spacy_crfsuite

🚀 Quickstart

Usage as a spaCy 3.0 pipeline component

import spacy

from spacy_crfsuite import CRFEntityExtractor, CRFExtractor

@Language.factory("ner-crf")
def create_my_component(nlp, name):
    crf_extractor = CRFExtractor().from_disk("spacy_crfsuite_conll03_sm.bz2")
    return CRFEntityExtractor(nlp, crf_extractor=crf_extractor)


nlp = spacy.load("en_core_web_sm", disable=["ner"])
nlp.add_pipe("ner-crf")

doc = nlp(
    "George Walker Bush (born July 6, 1946) is an American politician and businessman "
    "who served as the 43rd president of the United States from 2001 to 2009.")

for ent in doc.ents:
    print(ent, "-", ent.label_)

# Output:
# George Walker Bush - PER
# American - MISC
# United States - LOC

Pre-trained models

You can download a pre-trained model.

Dataset	F1	📥 Download
CoNLL03	82%	spacy_crfsuite_conll03_sm.bz2

Train your own model

Let's train a simple model for restaurent search bot with markdown annotations and the command line. You can also try this notebook.

So we start by training a model and saving it to disk.

$ python -m spacy_crfsuite.train examples/restaurent_search.md -c examples/default-config.json -o model/ -lm en_core_web_sm
ℹ Loading config from disk
✔ Successfully loaded config from file.
examples/default-config.json
ℹ Loading training examples.
✔ Successfully loaded 15 training examples from file.
examples/restaurent_search.md
ℹ Using spaCy model: en_core_web_sm
ℹ Training entity tagger with CRF.
ℹ Saving model to disk
✔ Successfully saved model to file.
model/model.pkl

We can also evaluate on a dev set to get f1 & classification report. Below we use the training examples.

$ python -m spacy_crfsuite.eval examples/restaurent_search.md -m model/model.pkl -lm en_core_web_sm
ℹ Loading model from file
model/model.pkl
✔ Successfully loaded CRF tagger
<spacy_crfsuite.crf_extractor.CRFExtractor object at 0x126e5f438>
ℹ Loading dev dataset from file
examples/example.md
✔ Successfully loaded 15 dev examples.
ℹ Using spaCy model: en_core_web_sm
⚠ f1 score: 1.0
              precision    recall  f1-score   support

   B-cuisine      1.000     1.000     1.000         2
   I-cuisine      1.000     1.000     1.000         1
   L-cuisine      1.000     1.000     1.000         2
   U-cuisine      1.000     1.000     1.000         5
  U-location      1.000     1.000     1.000         7

   micro avg      1.000     1.000     1.000        17
   macro avg      1.000     1.000     1.000        17
weighted avg      1.000     1.000     1.000        17

Now we can use the tagger in a spaCy pipeline!

import spacy

from spacy_crfsuite import CRFEntityExtractor

nlp = spacy.load('en_core_web_sm')
pipe = CRFEntityExtractor(nlp).from_disk("model/model.pkl")
nlp.add_pipe(pipe)

doc = nlp("show mexican restaurents up north")
for ent in doc.ents:
    print(ent.text, "--", ent.label_)

# Output:
# mexican -- cuisine
# north -- location

Or alternatively as a standalone component.

from spacy_crfsuite import CRFExtractor
from spacy_crfsuite.tokenizer import SpacyTokenizer

crf_extractor = CRFExtractor().from_disk("model/model.pkl")
tokenizer = SpacyTokenizer()

example = {"text": "show mexican restaurents up north"}
tokenizer.tokenize(example, attribute="text")
crf_extractor.process(example)

# Output:
# [{'start': 5,
#   'end': 12,
#   'value': 'mexican',
#   'entity': 'cuisine',
#   'confidence': 0.5823148506311286},
#  {'start': 28,
#   'end': 33,
#   'value': 'north',
#   'entity': 'location',
#   'confidence': 0.8863076478494413}]

We can also take a look at what model learned.

Use the .explain() method to understand model decision.

print(crf_extractor.explain())

# Output:
#
# Most likely transitions:
# O          -> O          1.637338
# B-cuisine  -> I-cuisine  1.373766
# U-cuisine  -> O          1.306077
# I-cuisine  -> L-cuisine  0.915989
# O          -> U-location 0.751463
# B-cuisine  -> L-cuisine  0.698893
# O          -> U-cuisine  0.480360
# U-location -> U-cuisine  0.403487
# O          -> B-cuisine  0.261450
# L-cuisine  -> O          0.182695
# 
# Positive features:
# 1.976502 O          0:bias:bias
# 1.957180 U-location -1:low:the
# 1.216547 B-cuisine  -1:low:for
# 1.153924 U-location 0:prefix5:centr
# 1.153924 U-location 0:prefix2:ce
# 1.110536 U-location 0:digit
# 1.058294 U-cuisine  0:prefix5:chine
# 1.058294 U-cuisine  0:prefix2:ch
# 1.051457 U-cuisine  0:suffix2:an
# 0.999976 U-cuisine  -1:low:me

Notice: You can also access the crf_extractor directly with nlp.get_pipe("crf_ner").crf_extractor.

Development

Set up virtualenv

$ pipenv sync -d

Run unit test

$ pipenv run pytest

Run black (code formatting)

$ pipenv run black spacy_crfsuite/ --config=pyproject.toml

Name		Name	Last commit message	Last commit date
Latest commit History 45 Commits
examples		examples
spacy_crfsuite		spacy_crfsuite
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

spacy_crfsuite: CRF tagger for spaCy.

✨ Features

⏳ Installation

🚀 Quickstart

Usage as a spaCy 3.0 pipeline component

Pre-trained models

Train your own model

Development

About

Releases

Packages

Languages

License

marzi-heidari/spacy_crfsuite

Folders and files

Latest commit

History

Repository files navigation

spacy_crfsuite: CRF tagger for spaCy.

✨ Features

⏳ Installation

🚀 Quickstart

Usage as a spaCy 3.0 pipeline component

Pre-trained models

Train your own model

Development

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages