Fix bugs in for translation & added unique forms check #551

axif0 · 2025-01-14T18:13:40Z

Contributor checklist

This pull request is on a separate branch and not the main branch
I have tested my code with the pytest command as directed in the testing section of the contributing guide

Description

Update .gitignore for ignoring dump and mediawiki dump.
Fix overwriting issue as discuses.
Fix option if user wants to download new version, as if there is already a downloaded file if user thinks its outdated, he gives download new version and system will download latest version.
Added save file functionality for parse_mediaWiki.py
Added test in test_get.py for QID based search.
Added new functionality for finding unique forms.
Match dump exported forms Json output with Query based json output. And also modified translation by selecting id as key. (Apologies for the confusion earlier; I initially thought it was based on words and that each one was unique.)

CC: @andrewtavis, @wkyoshida
Please let me know if there’s anything that needs to be done as we discussed.

Related issue

github-actions · 2025-01-14T18:14:07Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The linting and formatting workflow within the PR checks do not indicate new errors in the files changed
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

wkyoshida · 2025-01-18T01:49:00Z

src/scribe_data/cli/get.py

+    elif wikidata_dump is not None:
+        if not wikidata_dump:


Just wondering for myself -
but this is simply just a check for if wikidata_dump is an empty string ""? Did I get that right?

wkyoshida · 2025-01-18T01:49:05Z

src/scribe_data/cli/main.py

-        help="Path to a local Wikidata lexemes dump for running with '--all'.",
+        nargs="?",
+        const="",
+        help="Path to a local Wikidata lexemes dump. Uses default directory if no path provided.",


This here made me wonder..
Should we specify in the help message what the default path/directory would be? For this and other default directories as well?

wkyoshida · 2025-01-18T01:49:10Z

src/scribe_data/cli/total.py

    if all_bool and wikidata_dump:
-        language = "all"
+        if data_type is None:
+            data_type = "all"
+        if language is None:
+            language = "all"


nit - a quick suggestion:

The in-source docs here and elsewhere in the repo mention the following:

all_bool : boolean Whether all languages and data types should be listed.

This seems a tad misleading since setting a specific language and/or data_type will process only for all for that language and/or data_type specified. Perhaps something with the below slight adjustment might make sense?

all_bool : boolean Whether all languages and data types should be listed, unless otherwise specified.

wkyoshida · 2025-01-18T01:49:18Z

src/scribe_data/wiktionary/parse_dump.py

+
+    # def print_unique_forms(unique_forms):
+    #     """
+    #     Pretty print unique grammatical feature sets
+    #     """
+    #     for lang, lang_data in unique_forms.items():
+    #         print(f"\nLanguage: {lang}")
+    #         for category, features_list in lang_data.items():
+    #             print(f"  Category: {category}")
+    #             print(f"  Total unique feature sets: {len(features_list)}")
+    #             print("  Feature Sets:")
+    #             for i, feature_set in enumerate(features_list, 1):
+    #                 # Convert QIDs to a more readable format
+    #                 readable_features = [f"Q{qid}" for qid in feature_set]
+    #                 print(f"    {i}. {readable_features}")
+
+    # print_unique_forms(processor.unique_forms)
+    # print(processor.unique_forms)


Was this for testing?

This is designed to obtain unique forms for specific languages and/or data types. If needed, we can access this feature through a CLI command.

axif0 · 2025-01-19T21:26:41Z

Instead of splitting into multiple PRs, I thought it would be better to merge everything into a single PR. So, the last commit includes changes from #542 and #513, along with the suggestion Will mentioned.

When the workflows run, they automatically create new branches as needed. From those branches, they push changes using peter-evans/create-pull-request@v5. And make PR accordingly.

CC: @andrewtavis, @wkyoshida

andrewtavis · 2025-01-19T21:56:31Z

Note that the work here will also be used to close #271, #302 and #444 once we run the query generation workflow 🚀🚀

Really amazing to have all of this done already, @axif0! 😊 Next step from here would be working on #548. Do you think we need a functionality in the CLI to export the data contract? I'm thinking about how to do this effectively 🤔 On Scribe-Server we'd run the dump based extraction process and then get the new data, we'd test the contract to see that all fields are valid, and then we'd want a way to export the contract as well, right? Maybe testing the contract would also be another issue, but #548 could export the data_contracts directory so that we'd also be able to send those with the data?

So something like:

# Update all data and then...
scribe-data check-contracts  # success, all contracts are valid given the current data as all of the columns in contract values exist within the data
scribe-data export-contracts  # what we can do now - just move them to the root or an output directory

Specifically I don't think we should be expecting that people will get the contracts from the Scribe-Data copy they have installed, so they should be exportable. Let me know what you think on the above!

@wkyoshida: Let us know if you'd have time for mapping out further ideas for Scribe-Server. @axif0 does have Golang experience and interest in working on things, and with how things are progressing we'll likely be done with Scribe-Data v5.0 by the end of the month or early February. That gives us roughly a month that we could work on Scribe-Server :) :)

wkyoshida · 2025-01-21T00:46:20Z

Let us know if you'd have time for mapping out further ideas for Scribe-Server

Let's do it! Sometime this weekend? We can figure out a time on matrix 👍

… fix_dump

andrewtavis · 2025-01-22T02:01:34Z

Great, @wkyoshida! Just checked in in Mentors + Mentees :) @axif0, I'll be looking into this more tomorrow and Thursday. Thanks for your patience 🙏

src/scribe_data/check/check_missing_forms/download_wd.py

src/scribe_data/check/check_missing_forms/get_forms.py

src/scribe_data/wiktionary/parse_dump.py

andrewtavis

Thanks for the amazing work here, @axif0! Really great to see this amazing progress for Scribe-Data. Processing all of Wikidata for get and total commands in a matter of minutes is so great to see on so many levels 🚀🚀

axif0 added 5 commits January 12, 2025 03:03

fix small bugs

0fe6864

fix small bugs

15735d3

fix small bugs

c1bec87

translation add L:id

78b82c0

fix tests and add tests for QID

192b09c

axif0 requested review from andrewtavis and wkyoshida January 14, 2025 18:13

axif0 changed the title ~~Fix bugs in for translation & added unique forms~~ Fix bugs in for translation & added unique forms check Jan 15, 2025

fix total

cfc2777

wkyoshida reviewed Jan 18, 2025

View reviewed changes

axif0 added 2 commits January 20, 2025 03:18

fix small bugs

e302a9b

Merge branch 'main' into fix_dump

69c961a

andrewtavis mentioned this pull request Jan 20, 2025

Fix ambiguous features on Swedish verbs #6

Open

2 tasks

axif0 added 4 commits January 22, 2025 04:30

Refactor parameter names and fix single langugage translation error

0457372

Merge branch 'main' of https://github.com/scribe-org/Scribe-Data into…

6a6f545

… fix_dump

fix target_lang occored when fixing conflict

2479fe2

fix total tests

386d6a0

Merge branch 'main' into fix_dump

a532890

andrewtavis reviewed Jan 25, 2025

View reviewed changes

src/scribe_data/check/check_missing_forms/download_wd.py Show resolved Hide resolved

andrewtavis reviewed Jan 25, 2025

View reviewed changes

src/scribe_data/check/check_missing_forms/get_forms.py Show resolved Hide resolved

andrewtavis reviewed Jan 25, 2025

View reviewed changes

src/scribe_data/wiktionary/parse_dump.py Outdated Show resolved Hide resolved

andrewtavis reviewed Jan 25, 2025

View reviewed changes

src/scribe_data/wiktionary/parse_dump.py Outdated Show resolved Hide resolved

andrewtavis added 3 commits January 25, 2025 09:53

Misc formatting + simplifying code where possible

8c10962

Adding comments and minor name changes

789b178

Fixes to query all user flow and outputs and test changes

06d77b2

andrewtavis approved these changes Jan 25, 2025

View reviewed changes

andrewtavis merged commit 78e1772 into scribe-org:main Jan 25, 2025
5 checks passed

This was referenced Jan 25, 2025

Allow for get (g) CLI functionality to be ran using a Wikidata lexemes dump #521

Closed

Create GitHub action to automatically update the emoji keywords functionality #542

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix bugs in for translation & added unique forms check #551

Fix bugs in for translation & added unique forms check #551

axif0 commented Jan 14, 2025 •

edited by andrewtavis

Loading

github-actions bot commented Jan 14, 2025 •

edited by andrewtavis

Loading

wkyoshida Jan 18, 2025

wkyoshida Jan 18, 2025

wkyoshida Jan 18, 2025

wkyoshida Jan 18, 2025

axif0 Jan 18, 2025

axif0 commented Jan 19, 2025 •

edited

Loading

andrewtavis commented Jan 19, 2025

wkyoshida commented Jan 21, 2025

andrewtavis commented Jan 22, 2025

andrewtavis left a comment

Fix bugs in for translation & added unique forms check #551

Fix bugs in for translation & added unique forms check #551

Conversation

axif0 commented Jan 14, 2025 • edited by andrewtavis Loading

Contributor checklist

Description

Related issue

github-actions bot commented Jan 14, 2025 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

wkyoshida Jan 18, 2025

Choose a reason for hiding this comment

wkyoshida Jan 18, 2025

Choose a reason for hiding this comment

wkyoshida Jan 18, 2025

Choose a reason for hiding this comment

wkyoshida Jan 18, 2025

Choose a reason for hiding this comment

axif0 Jan 18, 2025

Choose a reason for hiding this comment

axif0 commented Jan 19, 2025 • edited Loading

andrewtavis commented Jan 19, 2025

wkyoshida commented Jan 21, 2025

andrewtavis commented Jan 22, 2025

andrewtavis left a comment

Choose a reason for hiding this comment

axif0 commented Jan 14, 2025 •

edited by andrewtavis

Loading

github-actions bot commented Jan 14, 2025 •

edited by andrewtavis

Loading

axif0 commented Jan 19, 2025 •

edited

Loading