-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix bugs in for translation & added unique forms check #551
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist |
src/scribe_data/cli/get.py
Outdated
elif wikidata_dump is not None: | ||
if not wikidata_dump: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just wondering for myself -
but this is simply just a check for if wikidata_dump
is an empty string ""
? Did I get that right?
src/scribe_data/cli/main.py
Outdated
help="Path to a local Wikidata lexemes dump for running with '--all'.", | ||
nargs="?", | ||
const="", | ||
help="Path to a local Wikidata lexemes dump. Uses default directory if no path provided.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This here made me wonder..
Should we specify in the help message what the default path/directory would be? For this and other default directories as well?
if all_bool and wikidata_dump: | ||
language = "all" | ||
if data_type is None: | ||
data_type = "all" | ||
if language is None: | ||
language = "all" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit - a quick suggestion:
The in-source docs here and elsewhere in the repo mention the following:
all_bool : boolean
Whether all languages and data types should be listed.
This seems a tad misleading since setting a specific language
and/or data_type
will process only for all for that language
and/or data_type
specified. Perhaps something with the below slight adjustment might make sense?
all_bool : boolean
Whether all languages and data types should be listed, unless otherwise specified.
|
||
# def print_unique_forms(unique_forms): | ||
# """ | ||
# Pretty print unique grammatical feature sets | ||
# """ | ||
# for lang, lang_data in unique_forms.items(): | ||
# print(f"\nLanguage: {lang}") | ||
# for category, features_list in lang_data.items(): | ||
# print(f" Category: {category}") | ||
# print(f" Total unique feature sets: {len(features_list)}") | ||
# print(" Feature Sets:") | ||
# for i, feature_set in enumerate(features_list, 1): | ||
# # Convert QIDs to a more readable format | ||
# readable_features = [f"Q{qid}" for qid in feature_set] | ||
# print(f" {i}. {readable_features}") | ||
|
||
# print_unique_forms(processor.unique_forms) | ||
# print(processor.unique_forms) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this for testing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is designed to obtain unique forms for specific languages and/or data types. If needed, we can access this feature through a CLI command.
Instead of splitting into multiple PRs, I thought it would be better to merge everything into a single PR. So, the last commit includes changes from #542 and #513, along with the suggestion Will mentioned. When the workflows run, they automatically create new branches as needed. From those branches, they push changes using CC: @andrewtavis, @wkyoshida |
Note that the work here will also be used to close #271, #302 and #444 once we run the query generation workflow 🚀🚀 Really amazing to have all of this done already, @axif0! 😊 Next step from here would be working on #548. Do you think we need a functionality in the CLI to export the data contract? I'm thinking about how to do this effectively 🤔 On Scribe-Server we'd run the dump based extraction process and then get the new data, we'd test the contract to see that all fields are valid, and then we'd want a way to export the contract as well, right? Maybe testing the contract would also be another issue, but #548 could export the data_contracts directory so that we'd also be able to send those with the data? So something like: # Update all data and then...
scribe-data check-contracts # success, all contracts are valid given the current data as all of the columns in contract values exist within the data
scribe-data export-contracts # what we can do now - just move them to the root or an output directory Specifically I don't think we should be expecting that people will get the contracts from the Scribe-Data copy they have installed, so they should be exportable. Let me know what you think on the above! @wkyoshida: Let us know if you'd have time for mapping out further ideas for Scribe-Server. @axif0 does have Golang experience and interest in working on things, and with how things are progressing we'll likely be done with Scribe-Data v5.0 by the end of the month or early February. That gives us roughly a month that we could work on Scribe-Server :) :) |
Let's do it! Sometime this weekend? We can figure out a time on matrix 👍 |
Great, @wkyoshida! Just checked in in Mentors + Mentees :) @axif0, I'll be looking into this more tomorrow and Thursday. Thanks for your patience 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the amazing work here, @axif0! Really great to see this amazing progress for Scribe-Data. Processing all of Wikidata for get
and total
commands in a matter of minutes is so great to see on so many levels 🚀🚀
Contributor checklist
pytest
command as directed in the testing section of the contributing guideDescription
download new version
and system will download latest version.id
as key. (Apologies for the confusion earlier; I initially thought it was based on words and that each one was unique.)CC: @andrewtavis, @wkyoshida
Please let me know if there’s anything that needs to be done as we discussed.
Related issue
get --data-type (-dt) -a
functionality #522get (g)
CLI functionality to be ran using a Wikidata lexemes dump #521