Lemmatisation of contraction 've in English fails #12920

chrisjbryant · 2023-08-17T08:07:09Z

Small bug, but 've is currently not getting lemmatised as have in spacy 3.6. Other contractions seem unaffected.

>>> nlp = spacy.load("en_core_web_sm")
>>> a = "I can't believe they've not been in touch."
>>> b = nlp(a)
>>> for tok in b:
...     print(tok.text, tok.lemma_)
... 
I I
ca can
n't not
believe believe
they they
've 've
not not
been be
in in
touch touch

The text was updated successfully, but these errors were encountered:

svlandeg · 2023-08-21T15:00:31Z

Hi, thanks for the report! That does look like a bug.

In the more recent trained pipelines, the attribute_ruler takes care of these particular exceptions. You can have a look into them by printing nlp.get_pipe("attribute_ruler").patterns if you're interested.

For instance, for 're, the pipeline does have this correct:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['are', "'re"]}}]], 'attrs': {'LEMMA': 'be', 'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

But for 've, the LEMMA is missing:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]], 'attrs': {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

The good news is, that you can fix this in your pipeline by writing to the attribute_ruler's patterns directly, e.g.

nlp = spacy.load("en_core_web_lg")
ruler = nlp.get_pipe("attribute_ruler")

pattern = [{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]
attrs = {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin', 'LEMMA': 'have'}
ruler.add(patterns=[pattern], attrs=attrs, index=0)

Now, any time 've is tagged as VBP in a sentence, its lemma should be have, as in your example sentence:

I I
ca can
n't not
believe believe
they they
've have
not not
been be
in in
touch touch

We'll also have a look at updating this for the next version of our models!

adrianeboyd · 2023-10-06T08:46:31Z

This should be fixed in the v3.7.x models.

github-actions · 2023-11-06T00:02:20Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

rmitsch added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models labels Aug 17, 2023

svlandeg added the bug Bugs and behaviour differing from documentation label Aug 21, 2023

adrianeboyd closed this as completed Oct 6, 2023

github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lemmatisation of contraction 've in English fails #12920

Lemmatisation of contraction 've in English fails #12920

chrisjbryant commented Aug 17, 2023

svlandeg commented Aug 21, 2023

adrianeboyd commented Oct 6, 2023

github-actions bot commented Nov 6, 2023

Lemmatisation of contraction 've in English fails #12920

Lemmatisation of contraction 've in English fails #12920

Comments

chrisjbryant commented Aug 17, 2023

svlandeg commented Aug 21, 2023

adrianeboyd commented Oct 6, 2023

github-actions bot commented Nov 6, 2023