Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lemmatisation of contraction 've in English fails #12920

Closed
chrisjbryant opened this issue Aug 17, 2023 · 3 comments
Closed

Lemmatisation of contraction 've in English fails #12920

chrisjbryant opened this issue Aug 17, 2023 · 3 comments
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models

Comments

@chrisjbryant
Copy link

Small bug, but 've is currently not getting lemmatised as have in spacy 3.6. Other contractions seem unaffected.

>>> nlp = spacy.load("en_core_web_sm")
>>> a = "I can't believe they've not been in touch."
>>> b = nlp(a)
>>> for tok in b:
...     print(tok.text, tok.lemma_)
... 
I I
ca can
n't not
believe believe
they they
've 've
not not
been be
in in
touch touch
@rmitsch rmitsch added feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models labels Aug 17, 2023
@svlandeg svlandeg added the bug Bugs and behaviour differing from documentation label Aug 21, 2023
@svlandeg
Copy link
Member

Hi, thanks for the report! That does look like a bug.

In the more recent trained pipelines, the attribute_ruler takes care of these particular exceptions. You can have a look into them by printing nlp.get_pipe("attribute_ruler").patterns if you're interested.

For instance, for 're, the pipeline does have this correct:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['are', "'re"]}}]], 'attrs': {'LEMMA': 'be', 'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

But for 've, the LEMMA is missing:

{'patterns': [[{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]], 'attrs': {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin'}, 'index': 0}

The good news is, that you can fix this in your pipeline by writing to the attribute_ruler's patterns directly, e.g.

nlp = spacy.load("en_core_web_lg")
ruler = nlp.get_pipe("attribute_ruler")

pattern = [{'TAG': 'VBP', 'LOWER': {'IN': ['have', "'ve"]}}]
attrs = {'POS': 'AUX', 'MORPH': 'Mood=Ind|Tense=Pres|VerbForm=Fin', 'LEMMA': 'have'}
ruler.add(patterns=[pattern], attrs=attrs, index=0)

Now, any time 've is tagged as VBP in a sentence, its lemma should be have, as in your example sentence:

I I
ca can
n't not
believe believe
they they
've have
not not
been be
in in
touch touch

We'll also have a look at updating this for the next version of our models!

@adrianeboyd
Copy link
Contributor

This should be fixed in the v3.7.x models.

Copy link
Contributor

github-actions bot commented Nov 6, 2023

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 6, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Bugs and behaviour differing from documentation feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / en English language data and models
Projects
None yet
Development

No branches or pull requests

4 participants