-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rule-based lemmatization in Spanish #4265
Rule-based lemmatization in Spanish #4265
Conversation
I have misspelled "pron" tag as "pronoun", latest commit fixes this issue. |
…thub.com/pablodms/spaCy into feature/spanish-rule-based-lemmatization
…ile size using these rules
Thanks for this! Unfortunately I think there's a problem here: I think the Wiktionary data is CC-BY SA, which I don't think is compatible with MIT? I'm not 100% sure but if it's not compatible it would mean we can't accept this :(. |
Yes, you are right, I just checked it out:
MIT license is not compatible with CC-BY SA 3.0 so you can't accept the work I have done :(. Sorry for the inconveniences. |
@pablodms Sucks :(. You can always make a package that people can use as an extension? |
Also, @pablodms Looks like you had some good stuff in here outside of the wiktionary stuff that might be worth merging? Do you want to send a separate PR for some of your rules and the implementation of DET and ADV? |
Sorry for the delay, These days I have been thinking about how to integrate wiktionary data without breaking the license terms. I believe that if I only put the code to download, parse and finally generate lemmatizer files in the spacy project, I will not be breaking license since spacy will not be distributing code based in Wiktionary data, only code that is able to download this data and use it only if requested by user. The final user will then be able to locally configure and use the Spanish lemmatizer using Wiktionary data. Some issues related to this idea are: time (1-2 minutes) and disk space (around 1GB) taken to parse dump file, and local privileges to download, create and copy needed files for lemmatizer. What do you think @honnibal? Should I make a new pull request following this strategy? |
@honnibal Any chance we can consider re-opening this PR and getting it merged? I just noticed that spacy already relies on this same dataset for other languages:
spaCy/spacy/lang/sv/morph_rules.py Line 7 in 146dc27
|
+1 to merge this! I've been running into similar issues and would love to have this merged, or at least living in its own package. Moreover, if GPLv2 or LGPL-LR were compatible, UDLexicons 0.2 could be used, which includes a couple of Spanish lexicons: UDLex_Spanish-Apertium (324925 words, GNU GPL v2) and UDLex_Spanish-Leffe (843426 words, LGPL-LR). On the other hand, our project spacy-affixes (a spaCy Pipeline to split clitics) uses Freeling rules and delegates the download of the data to the user so the license is honoured. |
Hello @versae, Since I think this request has been abandoned, I will try to develop a separate package following the idea of your project: implement lemmatizer as a spaCy pipeline stage. I will tell if I get any progress. Thanks for your suggestions. |
That's great news, @pablodms. A new pipeline would come in handy and if you need a hand with that just let me know. We could join forces! |
I have developed a basic version and uploaded it to pip, @versae. In the repository there are simple instructions to deploy the package. I have no experience in developing Python packages, in fact, this is my first attempt, so any help would be greatly appreciated. Your proyect has been a great inspiration, by the way. In addition, downloading (~65MB compressed file) and parsing (~900MB decompressed file) dump files is currently SLOW, so it can take several minutes. |
I can, @versae, add you as a contributor to the spanish-lemmatizer project so you can include your parser and your experience if you wish. |
Replaces lookup table based lemmatization in Spanish to rule-based lemmatization using Wiktionary ES latest dump.
Description
Parsing process is explained in the following Jupyter Notebook. This notebook parses a downloaded dump file from Wiktionary ES, extracts metadata for each term in Spanish language and finally writes "lemma_exc.json" which contains mappings from variations to root terms for adjectives, adverbs, nouns, verbs, pronouns and determinants .
Then, previous lemma_exc.json and simple lemma_index.json and lemma_rules.json have been added to spacy/lang/es/lemmatizer. __init__.py file in spacy/lang/es has been modified to load these new files.
Also, I have modified lemmatizer in spacy to add support to DET and ADV Pos Tags, needed for a correct lemmatization for determinants and adverbs in Spanish.
This way issue #2668 for Spanish lang should have been addressed.
Changes have been tested against sentences written in #2710 producing more coherent lemmatization.
Types of change
Enhacement
Checklist