Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bugs in for translation & added unique forms check #551

Merged
merged 16 commits into from
Jan 25, 2025

Conversation

axif0
Copy link
Collaborator

@axif0 axif0 commented Jan 14, 2025

Contributor checklist


Description

  • Update .gitignore for ignoring dump and mediawiki dump.
  • Fix overwriting issue as discuses.
  • Fix option if user wants to download new version, as if there is already a downloaded file if user thinks its outdated, he gives download new version and system will download latest version.
  • Added save file functionality for parse_mediaWiki.py
  • Added test in test_get.py for QID based search.
  • Added new functionality for finding unique forms.
  • Match dump exported forms Json output with Query based json output. And also modified translation by selecting id as key. (Apologies for the confusion earlier; I initially thought it was based on words and that each one was unique.)

CC: @andrewtavis, @wkyoshida
Please let me know if there’s anything that needs to be done as we discussed.

Related issue

Copy link

github-actions bot commented Jan 14, 2025

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The linting and formatting workflow within the PR checks do not indicate new errors in the files changed

  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@axif0 axif0 changed the title Fix bugs in for translation & added unique forms Fix bugs in for translation & added unique forms check Jan 15, 2025
Comment on lines 200 to 201
elif wikidata_dump is not None:
if not wikidata_dump:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering for myself -
but this is simply just a check for if wikidata_dump is an empty string ""? Did I get that right?

help="Path to a local Wikidata lexemes dump for running with '--all'.",
nargs="?",
const="",
help="Path to a local Wikidata lexemes dump. Uses default directory if no path provided.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This here made me wonder..
Should we specify in the help message what the default path/directory would be? For this and other default directories as well?

Comment on lines 369 to +373
if all_bool and wikidata_dump:
language = "all"
if data_type is None:
data_type = "all"
if language is None:
language = "all"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - a quick suggestion:

The in-source docs here and elsewhere in the repo mention the following:

    all_bool : boolean
        Whether all languages and data types should be listed.

This seems a tad misleading since setting a specific language and/or data_type will process only for all for that language and/or data_type specified. Perhaps something with the below slight adjustment might make sense?

    all_bool : boolean
        Whether all languages and data types should be listed, unless otherwise specified.

Comment on lines 623 to 640

# def print_unique_forms(unique_forms):
# """
# Pretty print unique grammatical feature sets
# """
# for lang, lang_data in unique_forms.items():
# print(f"\nLanguage: {lang}")
# for category, features_list in lang_data.items():
# print(f" Category: {category}")
# print(f" Total unique feature sets: {len(features_list)}")
# print(" Feature Sets:")
# for i, feature_set in enumerate(features_list, 1):
# # Convert QIDs to a more readable format
# readable_features = [f"Q{qid}" for qid in feature_set]
# print(f" {i}. {readable_features}")

# print_unique_forms(processor.unique_forms)
# print(processor.unique_forms)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this for testing?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is designed to obtain unique forms for specific languages and/or data types. If needed, we can access this feature through a CLI command.

@axif0
Copy link
Collaborator Author

axif0 commented Jan 19, 2025

Instead of splitting into multiple PRs, I thought it would be better to merge everything into a single PR. So, the last commit includes changes from #542 and #513, along with the suggestion Will mentioned.

When the workflows run, they automatically create new branches as needed. From those branches, they push changes using peter-evans/create-pull-request@v5. And make PR accordingly.

CC: @andrewtavis, @wkyoshida

@andrewtavis
Copy link
Member

Note that the work here will also be used to close #271, #302 and #444 once we run the query generation workflow 🚀🚀

Really amazing to have all of this done already, @axif0! 😊 Next step from here would be working on #548. Do you think we need a functionality in the CLI to export the data contract? I'm thinking about how to do this effectively 🤔 On Scribe-Server we'd run the dump based extraction process and then get the new data, we'd test the contract to see that all fields are valid, and then we'd want a way to export the contract as well, right? Maybe testing the contract would also be another issue, but #548 could export the data_contracts directory so that we'd also be able to send those with the data?

So something like:

# Update all data and then...
scribe-data check-contracts  # success, all contracts are valid given the current data as all of the columns in contract values exist within the data
scribe-data export-contracts  # what we can do now - just move them to the root or an output directory

Specifically I don't think we should be expecting that people will get the contracts from the Scribe-Data copy they have installed, so they should be exportable. Let me know what you think on the above!

@wkyoshida: Let us know if you'd have time for mapping out further ideas for Scribe-Server. @axif0 does have Golang experience and interest in working on things, and with how things are progressing we'll likely be done with Scribe-Data v5.0 by the end of the month or early February. That gives us roughly a month that we could work on Scribe-Server :) :)

@wkyoshida
Copy link
Member

Let us know if you'd have time for mapping out further ideas for Scribe-Server

Let's do it! Sometime this weekend? We can figure out a time on matrix 👍

@andrewtavis
Copy link
Member

Great, @wkyoshida! Just checked in in Mentors + Mentees :) @axif0, I'll be looking into this more tomorrow and Thursday. Thanks for your patience 🙏

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the amazing work here, @axif0! Really great to see this amazing progress for Scribe-Data. Processing all of Wikidata for get and total commands in a matter of minutes is so great to see on so many levels 🚀🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants