Rule-based lemmatization in Spanish #4265

pablodms · 2019-09-09T16:58:38Z

Replaces lookup table based lemmatization in Spanish to rule-based lemmatization using Wiktionary ES latest dump.

Description

Parsing process is explained in the following Jupyter Notebook. This notebook parses a downloaded dump file from Wiktionary ES, extracts metadata for each term in Spanish language and finally writes "lemma_exc.json" which contains mappings from variations to root terms for adjectives, adverbs, nouns, verbs, pronouns and determinants .

Then, previous lemma_exc.json and simple lemma_index.json and lemma_rules.json have been added to spacy/lang/es/lemmatizer. __init__.py file in spacy/lang/es has been modified to load these new files.

Also, I have modified lemmatizer in spacy to add support to DET and ADV Pos Tags, needed for a correct lemmatization for determinants and adverbs in Spanish.

This way issue #2668 for Spanish lang should have been addressed.

Changes have been tested against sentences written in #2710 producing more coherent lemmatization.

Types of change

Enhacement

Checklist

I have submitted the spaCy Contributor Agreement.
I ran the tests, and all new and existing tests passed. -> I got 59 fails both in master and modified branch.
My changes don't require a change to the documentation, or if they do, I've added all required information.

pablodms · 2019-09-09T17:25:50Z

I have misspelled "pron" tag as "pronoun", latest commit fixes this issue.

…thub.com/pablodms/spaCy into feature/spanish-rule-based-lemmatization

…ion.

…ile size using these rules

honnibal · 2019-09-12T15:21:28Z

Thanks for this!

Unfortunately I think there's a problem here: I think the Wiktionary data is CC-BY SA, which I don't think is compatible with MIT? I'm not 100% sure but if it's not compatible it would mean we can't accept this :(.

pablodms · 2019-09-12T17:04:04Z

Yes, you are right, I just checked it out:

"ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original." (or a compatible one).

MIT license is not compatible with CC-BY SA 3.0 so you can't accept the work I have done :(.

Sorry for the inconveniences.

honnibal · 2019-09-15T20:15:30Z

@pablodms Sucks :(. You can always make a package that people can use as an extension?

sontek · 2019-09-29T20:54:53Z

Has there been any progress towards splitting this out into its own extension? I'm have a significant amount of issues with Spanish and this seems like it might help? Here are the problems I'm having:

#4341
#4340
#4255
#4254
#4253

But it seems like a lot of them would address both #3052 and your original idea for fixing #2668

sontek · 2019-09-29T21:14:08Z

Also, @pablodms Looks like you had some good stuff in here outside of the wiktionary stuff that might be worth merging? Do you want to send a separate PR for some of your rules and the implementation of DET and ADV?

pablodms · 2019-09-30T11:49:22Z

Hello @honnibal and @sontek,

Sorry for the delay,

These days I have been thinking about how to integrate wiktionary data without breaking the license terms. I believe that if I only put the code to download, parse and finally generate lemmatizer files in the spacy project, I will not be breaking license since spacy will not be distributing code based in Wiktionary data, only code that is able to download this data and use it only if requested by user. The final user will then be able to locally configure and use the Spanish lemmatizer using Wiktionary data.

Some issues related to this idea are: time (1-2 minutes) and disk space (around 1GB) taken to parse dump file, and local privileges to download, create and copy needed files for lemmatizer.

What do you think @honnibal? Should I make a new pull request following this strategy?

sontek · 2019-10-22T05:40:04Z

@honnibal Any chance we can consider re-opening this PR and getting it merged? I just noticed that spacy already relies on this same dataset for other languages:

spaCy/spacy/lang/el/get_pos_from_wiktionary.py

Line 5 in e0cf479

def get_pos_from_wiktionary():

spaCy/spacy/lang/sv/morph_rules.py

Line 7 in 146dc27

# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras

versae · 2019-11-20T13:37:05Z

+1 to merge this!

I've been running into similar issues and would love to have this merged, or at least living in its own package. Moreover, if GPLv2 or LGPL-LR were compatible, UDLexicons 0.2 could be used, which includes a couple of Spanish lexicons: UDLex_Spanish-Apertium (324925 words, GNU GPL v2) and UDLex_Spanish-Leffe (843426 words, LGPL-LR).

On the other hand, our project spacy-affixes (a spaCy Pipeline to split clitics) uses Freeling rules and delegates the download of the data to the user so the license is honoured.

pablodms · 2019-11-21T09:55:11Z

Hello @versae,

Since I think this request has been abandoned, I will try to develop a separate package following the idea of your project: implement lemmatizer as a spaCy pipeline stage. I will tell if I get any progress.

Thanks for your suggestions.

versae · 2019-11-21T10:13:27Z

That's great news, @pablodms. A new pipeline would come in handy and if you need a hand with that just let me know. We could join forces!

pablodms · 2019-11-21T17:14:55Z

I have developed a basic version and uploaded it to pip, @versae. In the repository there are simple instructions to deploy the package. I have no experience in developing Python packages, in fact, this is my first attempt, so any help would be greatly appreciated. Your proyect has been a great inspiration, by the way.

In addition, downloading (~65MB compressed file) and parsing (~900MB decompressed file) dump files is currently SLOW, so it can take several minutes.

versae · 2019-11-26T11:55:55Z

That's awesome, @pablodms! Thanks for releasing it in such a short time :) Maybe in the future we could add the lemmas from UDLexicons too. I have a parser of the format and would not be a lot of work.

Also wondering whether it'd be possible to merge this into spaCy, @honnibal?

pablodms · 2019-11-28T10:47:10Z

I can, @versae, add you as a contributor to the spanish-lemmatizer project so you can include your parser and your experience if you wish.

pablodms added 2 commits September 9, 2019 17:44

Exclude lists for Spanish and new POS TAGS in lemmatizer

40f54e8

Fix: added pronouns support

f6012ca

ines added enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / es Spanish language data and models labels Sep 9, 2019

Fix: added pronouns suppor

f0b86e3

pablodms changed the title ~~Exclude lists for Spanish and new POS TAGS in lemmatizer~~ Rule-based lemmatization in Spanish Sep 9, 2019

ines and others added 6 commits September 10, 2019 20:04

Merge branch 'master' into pr/4265

988d721

Fixed wrong lemmatized values

a7d34ee

Merge branch 'feature/spanish-rule-based-lemmatization' of https://gi…

f1e0c78

…thub.com/pablodms/spaCy into feature/spanish-rule-based-lemmatization

Isolated cases fixed and changes in website api annotation documentat…

cfc7118

…ion.

Included unambiguous rules for regular verbs and reduced exceptions f…

189374d

…ile size using these rules

Manual hardcoded list of determinants

badd9a4

honnibal closed this Sep 15, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rule-based lemmatization in Spanish #4265

Rule-based lemmatization in Spanish #4265

pablodms commented Sep 9, 2019

pablodms commented Sep 9, 2019

honnibal commented Sep 12, 2019

pablodms commented Sep 12, 2019

honnibal commented Sep 15, 2019

sontek commented Sep 29, 2019

sontek commented Sep 29, 2019

pablodms commented Sep 30, 2019

sontek commented Oct 22, 2019

versae commented Nov 20, 2019 •

edited

Loading

pablodms commented Nov 21, 2019

versae commented Nov 21, 2019

pablodms commented Nov 21, 2019

versae commented Nov 26, 2019 •

edited

Loading

pablodms commented Nov 28, 2019

Rule-based lemmatization in Spanish #4265

Rule-based lemmatization in Spanish #4265

Conversation

pablodms commented Sep 9, 2019

Description

Types of change

Checklist

pablodms commented Sep 9, 2019

honnibal commented Sep 12, 2019

pablodms commented Sep 12, 2019

honnibal commented Sep 15, 2019

sontek commented Sep 29, 2019

sontek commented Sep 29, 2019

pablodms commented Sep 30, 2019

sontek commented Oct 22, 2019

versae commented Nov 20, 2019 • edited Loading

pablodms commented Nov 21, 2019

versae commented Nov 21, 2019

pablodms commented Nov 21, 2019

versae commented Nov 26, 2019 • edited Loading

pablodms commented Nov 28, 2019

versae commented Nov 20, 2019 •

edited

Loading

versae commented Nov 26, 2019 •

edited

Loading