Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rule-based lemmatization in Spanish #4265

Closed
wants to merge 9 commits into from
Closed

Rule-based lemmatization in Spanish #4265

wants to merge 9 commits into from

Conversation

pablodms
Copy link

@pablodms pablodms commented Sep 9, 2019

Replaces lookup table based lemmatization in Spanish to rule-based lemmatization using Wiktionary ES latest dump.

Description

Parsing process is explained in the following Jupyter Notebook. This notebook parses a downloaded dump file from Wiktionary ES, extracts metadata for each term in Spanish language and finally writes "lemma_exc.json" which contains mappings from variations to root terms for adjectives, adverbs, nouns, verbs, pronouns and determinants .

Then, previous lemma_exc.json and simple lemma_index.json and lemma_rules.json have been added to spacy/lang/es/lemmatizer. __init__.py file in spacy/lang/es has been modified to load these new files.

Also, I have modified lemmatizer in spacy to add support to DET and ADV Pos Tags, needed for a correct lemmatization for determinants and adverbs in Spanish.

This way issue #2668 for Spanish lang should have been addressed.

Changes have been tested against sentences written in #2710 producing more coherent lemmatization.

Types of change

Enhacement

Checklist

  • I have submitted the spaCy Contributor Agreement.
  • I ran the tests, and all new and existing tests passed. -> I got 59 fails both in master and modified branch.
  • My changes don't require a change to the documentation, or if they do, I've added all required information.

@ines ines added enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / es Spanish language data and models labels Sep 9, 2019
@pablodms
Copy link
Author

pablodms commented Sep 9, 2019

I have misspelled "pron" tag as "pronoun", latest commit fixes this issue.

@pablodms pablodms changed the title Exclude lists for Spanish and new POS TAGS in lemmatizer Rule-based lemmatization in Spanish Sep 9, 2019
@honnibal
Copy link
Member

Thanks for this!

Unfortunately I think there's a problem here: I think the Wiktionary data is CC-BY SA, which I don't think is compatible with MIT? I'm not 100% sure but if it's not compatible it would mean we can't accept this :(.

@pablodms
Copy link
Author

Yes, you are right, I just checked it out:

"ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original." (or a compatible one).

MIT license is not compatible with CC-BY SA 3.0 so you can't accept the work I have done :(.

Sorry for the inconveniences.

@honnibal
Copy link
Member

@pablodms Sucks :(. You can always make a package that people can use as an extension?

@honnibal honnibal closed this Sep 15, 2019
@sontek
Copy link

sontek commented Sep 29, 2019

Has there been any progress towards splitting this out into its own extension? I'm have a significant amount of issues with Spanish and this seems like it might help? Here are the problems I'm having:

#4341
#4340
#4255
#4254
#4253

But it seems like a lot of them would address both #3052 and your original idea for fixing #2668

@sontek
Copy link

sontek commented Sep 29, 2019

Also, @pablodms Looks like you had some good stuff in here outside of the wiktionary stuff that might be worth merging? Do you want to send a separate PR for some of your rules and the implementation of DET and ADV?

@pablodms
Copy link
Author

Hello @honnibal and @sontek,

Sorry for the delay,

These days I have been thinking about how to integrate wiktionary data without breaking the license terms. I believe that if I only put the code to download, parse and finally generate lemmatizer files in the spacy project, I will not be breaking license since spacy will not be distributing code based in Wiktionary data, only code that is able to download this data and use it only if requested by user. The final user will then be able to locally configure and use the Spanish lemmatizer using Wiktionary data.

Some issues related to this idea are: time (1-2 minutes) and disk space (around 1GB) taken to parse dump file, and local privileges to download, create and copy needed files for lemmatizer.

What do you think @honnibal? Should I make a new pull request following this strategy?

@sontek
Copy link

sontek commented Oct 22, 2019

@honnibal Any chance we can consider re-opening this PR and getting it merged? I just noticed that spacy already relies on this same dataset for other languages:

def get_pos_from_wiktionary():

# Used the table of pronouns at https://sv.wiktionary.org/wiki/deras

@versae
Copy link

versae commented Nov 20, 2019

+1 to merge this!

I've been running into similar issues and would love to have this merged, or at least living in its own package. Moreover, if GPLv2 or LGPL-LR were compatible, UDLexicons 0.2 could be used, which includes a couple of Spanish lexicons: UDLex_Spanish-Apertium (324925 words, GNU GPL v2) and UDLex_Spanish-Leffe (843426 words, LGPL-LR).

On the other hand, our project spacy-affixes (a spaCy Pipeline to split clitics) uses Freeling rules and delegates the download of the data to the user so the license is honoured.

@pablodms
Copy link
Author

Hello @versae,

Since I think this request has been abandoned, I will try to develop a separate package following the idea of your project: implement lemmatizer as a spaCy pipeline stage. I will tell if I get any progress.

Thanks for your suggestions.

@versae
Copy link

versae commented Nov 21, 2019

That's great news, @pablodms. A new pipeline would come in handy and if you need a hand with that just let me know. We could join forces!

@pablodms
Copy link
Author

I have developed a basic version and uploaded it to pip, @versae. In the repository there are simple instructions to deploy the package. I have no experience in developing Python packages, in fact, this is my first attempt, so any help would be greatly appreciated. Your proyect has been a great inspiration, by the way.

In addition, downloading (~65MB compressed file) and parsing (~900MB decompressed file) dump files is currently SLOW, so it can take several minutes.

@versae
Copy link

versae commented Nov 26, 2019

That's awesome, @pablodms! Thanks for releasing it in such a short time :) Maybe in the future we could add the lemmas from UDLexicons too. I have a parser of the format and would not be a lot of work.

Also wondering whether it'd be possible to merge this into spaCy, @honnibal?

@pablodms
Copy link
Author

I can, @versae, add you as a contributor to the spanish-lemmatizer project so you can include your parser and your experience if you wish.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature requests and improvements feat / lemmatizer Feature: Rule-based and lookup lemmatization lang / es Spanish language data and models
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants