-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Extended OMW #183
base: gh-pages
Are you sure you want to change the base?
Update Extended OMW #183
Conversation
Hi,
please do not redistribute the Persian and Chinese data, because of the
quality issues. We asked you not to in #171, and you agreed not to, so I
am surprised to see them here.
…On Thu, Feb 17, 2022 at 9:37 PM Eric Kafe ***@***.***> wrote:
This PR updates the "extended_omw" package with additional wordnets from
the "wns" folder in the recent OMW 1.4 source release (retrieved from
https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip).
In particular, this PR adds Persian ("fas") and an alternative Chinese
wordnet ("qcn") which are included in NLTK's "omw" package, but were left
out of omw-1.4 because of quality concerns (cf. discussions at #171
<#171>).
Everything in this PR was just copied verbatim from the upstream source
release. As a consequence, all folders now include LICENSE and citation.bib
files, so that the standard citation() and license() functions return
appropriate information about the languages covered in extended_omw.
Sample use, assuming nltk/nltk#2946
<nltk/nltk#2946>:
import nltk
from nltk.corpus import wordnet as wn
print(f"Loaded Wordnet v. {wn.get_version()} with {len(wn.langs())}
languages from OMW-1.4")
Loaded Wordnet v. 3.0 with 32 languages from OMW-1.4
wn.add_exomw()
print(f"Loaded {len(wn.langs())} languages in total with Extended OMW")
Loaded 1194 languages in total with Extended OMW
ss=wn.synset('example.n.01')
print(ss.lemma_names(lang="cmn"))
['事例', '例', '例子', '例证']
print(ss.lemma_names(lang="cmn_wikt"))
['例子', '例', '榜样', '例证']
print(ss.lemma_names(lang="qcn"))
['例子', '比方']
------------------------------
You can view, comment on, or merge this pull request online at:
#183
Commit Summary
- 6b5278f
<6b5278f>
Update Extended OMW
- 1b32607
<1b32607>
Merge remote-tracking branch 'upstream/gh-pages' into exomw
File Changes
(2 files <https://github.com/nltk/nltk_data/pull/183/files>)
- *M* packages/corpora/extended_omw.xml
<https://github.com/nltk/nltk_data/pull/183/files#diff-1ff55a3a09ddefb685985be6dde025c09c57bdc761cbada38264748a6c37d252>
(4)
- *M* packages/corpora/extended_omw.zip
<https://github.com/nltk/nltk_data/pull/183/files#diff-1045d31f057dc43656ead660c2f4619ae9d29416746a2e2cf848fa1c45504774>
(0)
Patch Links:
- https://github.com/nltk/nltk_data/pull/183.patch
- https://github.com/nltk/nltk_data/pull/183.diff
—
Reply to this email directly, view it on GitHub
<#183>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAIPZRWAI3AG2UB37UC44FLU3TFVTANCNFSM5OUIQHGQ>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
--
Francis Bond <https://fcbond.github.io/>
Division of Linguistics and Multilingual Studies
Nanyang Technological University
|
@fcbond: Of course I will remove these two languages if you insist. |
According to #171 (comment), "Native speakers of Farsi and Mandarin have pointed out that these two resources have some quality issues". It could be interesting to hear anything about the severity of the alleged issues. And wouldn't the same argument apply to all wordnets? In particular, many quality issues have been reported about Princeton Wordnet. Issues are also often raised in OEWN. Discussing the issues openly is a way to eventually solve them... |
Two languages ('fas' and 'qcn') were retracted, since @fcbond clearly does not allow their redistribution, cf. #183 (comment). The big wordnetwiktionaryalignments-2013-02-19.tsv file is not included, since there is no handler for it. So now, the proposed update consists in the addition of citation.bib files in the wikt and cldr folders, and 3 updated wiktionary wordnets, with the following numbers of lemmas: 2567 wn-wikt-als.tab (Tosk Albanian) |
@ekaf: sorry for the delay. I don't want to blow away the existing zipfile with a new one, but to replace individual files. Would you please help me out with a list of the required files? Is it:
I'm confused because you say: "3 new wiktionary wordnets", but those 3 files already exist. Also, I see a new top-level |
@stevenbird, yes, your list is accurate. The top-level citation.bib refers to the whole OMW project and should be added as well. |
Hi @stevenbird, thanks for your interest :) Yes, this package is a drop-in update of @ExplorerFreda's original package. I think it is ok, except that there is now a newer webpage URL (https://omwn.org) to include in extended_omw.xml. The topmost README file might also benefit from some editing. |
What do you think @ExplorerFreda? This PR is old, so the need for an updated package may not be acute. |
This PR updates the "extended_omw" package with additional wordnets from the "wns" folder in the recent OMW 1.4 source release (retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip).
In particular, this PR corrects large numbers of errors in the Tosk Albanian ('als'), Standard Arabic ('als') and Castilian ('spa') wiktionary wordnets in the 'wikt' folder.
First added, but retracted again, following #183 (comment) : Persian ("fas") and an alternative Chinese wordnet ("qcn") which are included in NLTK's "omw" package, but were left out of omw-1.4 because of quality concerns (cf. discussions at #171).
Everything in this PR was just copied verbatim from the upstream source release. As a consequence, all folders now include LICENSE and citation.bib files, so that the standard citation() and license() functions return appropriate information about the languages covered in extended_omw.
Sample use, assuming nltk/nltk#2946:
Loaded Wordnet v. 3.0 with 32 languages from OMW-1.4
Loaded 1192 languages in total with Extended OMW
['事例', '例', '例子', '例证']
['例子', '例', '榜样', '例证']
Retracted:
['例子', '比方']