Scribe-data CLI tool implementation #140

mhmohona · 2024-06-04T02:27:40Z

Contributor checklist

This pull request is on a separate branch and not the main branch

Description

list-languages (-ll)

list available lang codes
- commands:
  scribe-data ll
  
  scribe-data languages-list
list available word types per lang
- commands:
  scribe-data list-word-types -l German

language (-l) and word-type (--wt)

commands:
scribe-data query -l English -wt nouns

scribe-data query -l English -wt verbs

scribe-data query -l English -wt translated_words

Related issue

Fixes

github-actions · 2024-06-04T02:27:57Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

mhmohona · 2024-06-04T02:31:04Z

@andrewtavis, @wkyoshida, I have worked on listing all language as this task seemed easier and wanted to take baby step towards the complete cli tool. Here is the output of all languages -

The file name Will suggested first, which was scribe-data.py seems better now, as it would make the commands look pretty. Also I think we need to move the cli script in root directory, so that commands become simple.

andrewtavis · 2024-06-04T07:04:02Z

I think that we could keep it as cli.py, as something that will happen is that when Scribe-Data is installed we should be able to directly access it from the command like without saying python3 .... With that we should be able to name it what we want, and then we'd of course change the cli trigger to scribe-data as Will you suggested 😊

andrewtavis · 2024-06-04T07:06:33Z

Do you want to check the linting and formatting checks for your commit, @mhmohona? No stress on the Mac build fail, sadly, but the formatting check did have some things that need to be fixed. You can see that above!

This will hopefully get easier once the new contributor has the pre-commit issue done as the linting fixes will be done for you on commit 😊

mhmohona · 2024-06-05T04:38:51Z

Here is the update for query.

python3 src/scribe_data/cli.py query -l German -wt nouns

python3 src/scribe_data/cli.py query -l German -wt verbs

python3 src/scribe_data/cli.py query -l Russian -wt translated_words

Now the question, for emoji keywords, auto suggestions, and translations files - shall I add about them as well?
Secondly, is the formatting ok? Or shall I put it in table?

andrewtavis · 2024-06-05T22:20:42Z

Hey @mhmohona 👋 Checking on this, is this getting the JSON values from the language_data_export directory, or running update_data.py given the arguments? The latter would be the planned functionality, but we can also works towards it!

Another thing, maybe you could research how to look into implementing it so that we have the CLI installed when the package is installed. So in the installation instructions we have the following (pre-commit was just added and checks your commits - definitely suggested to adopt it 😊):

pip install --upgrade pip  # make sure that pip is at the latest version
pip install -r requirements.txt  # install dependencies
pip install -e .  # install the local version of Scribe-Data
pre-commit install  # install pre-commit hooks
# pre-commit run --all-files  # lint and fix common problems in the codebase

What would be really great would be if the process of doing pip install -e . would mean that rather than:

python3 src/scribe_data/cli.py query -l German -wt verbs

we could instead do:

scribe-data query -l German -wt verbs

I'm assuming that this would be changes in setup.py or another installation setting 🤔 @wkyoshida, do you have an idea on this?

mhmohona · 2024-06-06T02:11:40Z

@andrewtavis,so Ihave updated the commands. It now looks like this -

Still needs to work on the language code, so instead of writing English,only en would work.

andrewtavis · 2024-06-06T07:11:09Z

This is great, @mhmohona! Thanks for all the hard work!

andrewtavis · 2024-06-06T07:12:11Z

Requesting review from both of us eventually, @wkyoshida :) Let us know when it's ready for a final check, @mhmohona, and we're of course happy to answer questions along the way!

wkyoshida

Hey! Awesome to see the CLI taking shape!! 🚀

Just adding some thoughts here - but I'm fine if we leave most of them for later as work on the CLI implementation continues.
For now, what if we just take a look at the two comments that I marked with a 📍 just so we get the initial command structure going so we can close off #136 ? We can leave the implementation for the commands to other PRs; that's fine with me 👍

wkyoshida · 2024-06-09T00:51:13Z

src/scribe_data/cli.py

+    query_parser.add_argument('-l', '--language', required=True, help='Language code')
+    query_parser.add_argument('-wt', '--word-type', required=True, help='Word type')


Wouldn't we want that neither --language nor --word-type be required perhaps? So that one would have the option to get all languages or all word types at once too?

I am thinking though that perhaps we might want a check that if one does specify --word-type, --language must also be specified in that case.

EDIT: Mistyped the first time. Meant to say "required" as opposed to "optional"

I'd say that for some cases we might not want to include --language, but that's also because I've been reconsidering the structure for #148. We have --language and --word-type as options, so maybe we could even do the following with a list command that uses them as arguments:

scribe-data list --language # list all languages scribe-data list --word-type # list all word types scribe-data list --language German --word-type # list all German word types scribe-data list --language --word-type nouns # list all languages that you can get nouns for

The functionality has generally been focussed on using --language and --word-type to subset, but maybe this could also work? I'm just kind of throwing this out :) Would this make sense @mhmohona and @wkyoshida?

Regardless, for list-word-types --language wouldn't be necessary :)

Oh my comment above was regarding the query command actually hahah, but yea, I do see what you're saying with the list* commands 🤔 I feel a bit iffy with them too.. Let's try to hash it out in #148 to what might make more sense then!

If it makes sense - I was thinking of trying to keep this PR contained more so around getting #136 out of the way. Meaning - I'm OK if we just leave list-languages and list-word-types as is for now, while we decide on what to do

src/scribe_data/cli.py

wkyoshida · 2024-06-09T00:51:34Z

src/scribe_data/cli.py

+    print("Available languages:")
+    for lang in languages:
+        print(f"- {lang}")


Using language_meta_data.json, it could perhaps also be useful to output the "iso" and "qid" attributes as well for each language, along with the name

Ya returning those would also be nice :)

Getting data like this -

This is so great, @mhmohona! Really set the standard for how list looks now :)

wkyoshida · 2024-06-09T00:51:36Z

src/scribe_data/cli.py

+        word_types = [wt.stem for wt in language_dir.glob('*.json')]
+        if not word_types:
+            print(f"No word types available for language '{normalized_language}'.")
+            return


The available word types could simply be info that we grab from a new attribute "word-types" or something that we add to the language_meta_data.json

Note that I moved away from this in the most recent commits as I realized that we were gonna have an issue where if we didn't update the language_metadata.json (note the rename) file, then we'd lose functionality... Let's check on this, but I think a great way to go about this would be to assure that language_data_extraction has a directory for each word type, and then from there we can just read in the structure for the language-word type combinations 😊

wkyoshida · 2024-06-09T00:51:39Z

src/scribe_data/cli.py

+        word_types = set()
+        for lang_dir in DATA_DIR.iterdir():
+            if lang_dir.is_dir():
+                word_types.update(wt.stem for wt in lang_dir.glob('*.json'))
+
+        if not word_types:
+            print("No word types available.")
+            return


I'm a bit iffy with simply outputting all the word types that any language might have support for, as that seems to be maybe not quite as useful or even misleading since some are available in some languages but not others.

Perhaps what could be good here instead is to list all languages and then all word types within each language?

src/scribe_data/cli.py

andrewtavis · 2024-06-22T11:17:23Z

Hey @mhmohona and @wkyoshida! 👋 I'm going to give this a look right now and fix up the test errors so we can bring this in :) We've got lots in here already, and I think the path ahead will be more clear once this is done and we can do one PR per issue 😊

andrewtavis · 2024-06-22T11:52:49Z

setup.py

@@ -47,6 +47,11 @@
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/scribe-org/Scribe-Data",
+    entry_points={


Super great 😊

andrewtavis · 2024-06-22T11:53:56Z

src/scribe_data/cli/list.py

@@ -0,0 +1,153 @@
+import json


We need the license note that the top of this file, but I'm adding it to my local version of the PR now :)

andrewtavis · 2024-06-22T11:55:08Z

src/scribe_data/cli/list.py

+from pathlib import Path
+from typing import Dict, List, Union
+
+# Load language metadata from JSON file


Full line comments should have a period at the end of them as they're complete sentences :)

andrewtavis · 2024-06-22T12:04:56Z

src/scribe_data/cli/list.py

+    with METADATA_FILE.open('r', encoding='utf-8') as file:
+        return json.load(file)
+
+LANGUAGE_METADATA = load_language_metadata()


I'm going to switch these to just normal snake case as to me "screaming snake case" is for constants that are defined directly, not function results. METADATA_FILE above is ok, but generally I'd consider:

A_STRING = "string"

A_NEW_STRING = f"new {A_STRING}"

AN_INT = 5

etc

The METADATA_FILE path should not change, but then we're going to be expanding language_metadata from time to time.

Let me know if this makes sense, @mhmohona and @wkyoshida!

andrewtavis · 2024-06-22T12:06:22Z

src/scribe_data/cli/list.py

+LANGUAGE_METADATA = load_language_metadata()
+LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']}
+
+DATA_DIR = Path('scribe_data_json_export')


Will move this up so that all screaming snake case variables are defined together at the top of the file :)

andrewtavis · 2024-06-22T12:07:40Z

src/scribe_data/cli/list.py

+# Load language metadata from JSON file
+METADATA_FILE = Path(__file__).parent.parent / 'resources' / 'language_meta_data.json'
+
+def load_language_metadata() -> Dict:


I think we're also fine to just use the with statement directly without the function :)

andrewtavis · 2024-06-22T12:13:34Z

src/scribe_data/cli/list.py

+DATA_DIR = Path('scribe_data_json_export')
+
+def print_formatted_data(data: Union[Dict, List], word_type: str) -> None:
+    if not data:


Let's put a vertical space between the if-else statements so it's a bit easier to read them 🤓

So:

if something: True # <- Space here so it's easy to see where each condition ends. else: False

andrewtavis · 2024-06-22T12:14:06Z

src/scribe_data/cli/list.py

+        return
+
+    if word_type == 'autosuggestions':
+        max_key_length = max((len(key) for key in data.keys()), default=0)


The definition of max_key_length is the same for all of these, so we can move it above the first if-statement :)

andrewtavis · 2024-06-22T12:17:19Z

src/scribe_data/cli/list.py

+            print(data)
+
+def list_languages() -> None:
+    languages = [lang for lang in LANGUAGE_METADATA['languages']]


We can just do list(language_metadata["languages"]) here :)

andrewtavis · 2024-06-22T12:19:25Z

src/scribe_data/cli/list.py

+
+    # Define column widths
+    language_col_width = max(len(lang['language']) for lang in languages) + 2
+    iso_col_width = 5  # Length of "ISO" column header + padding


Lowercase length as inline comments aren't full sentence - so no period and first word lowercase :) Full line comments are complete sentences with a capitalized first letter and a period at the end, as mentioned above :) I'll fix these!

Also, I converted the lines here over to use the dictionary as above in case other languages we add later have longer QIDs :)

andrewtavis · 2024-06-22T12:29:43Z

src/scribe_data/cli/list.py

+    for lang in languages:
+        print(f"{lang['language'].capitalize():<{language_col_width}} {lang['iso']:<{iso_col_width}} {lang['qid']:<{qid_col_width}}")
+
+def list_word_types(language: str = None) -> None:


Quick note, @mhmohona: let's for sure include documentation strings for all functions. You can check other files to see how to write them :)

So:

def fxn(): """ Docstring :) """ return

The best part is that we'll be able to have these auto-documented via the docstrings in the Sphinx docs :)

Note that we do need the structure of the docstrings to be consistent though as there's a parser from Numpy that will make the docs. Again, just copy the structure from other docstrings and all will be well 😊

andrewtavis · 2024-06-22T12:31:19Z

src/scribe_data/cli/main.py

+"""
+Setup and commands for the Scribe-Data command line interface.
+
+.. raw:: html


This copyright notice is the one I was talking about. We should have it at the top of each Python file and change the first line for the functionality of the file :)

andrewtavis · 2024-06-22T12:34:41Z

src/scribe_data/cli/main.py

+    parser = argparse.ArgumentParser(description='Scribe-Data CLI Tool')
+    subparsers = parser.add_subparsers(dest='command', required=True)
+
+    # List command


I think that a good way to do this would be # MARK:, which will further put a section in the minimap. I've been starting to do this in my Python code recently, and it's really great!

Check out the new version of cli/main.py to see what I mean :) Here's a screenshot of my minimap in VS Code with all the MARKs showing:

andrewtavis · 2024-06-22T12:37:33Z

src/scribe_data/resources/language_meta_data.json

    },
    {
      "language": "italian",
      "iso": "it",
      "qid": "Q652",
      "remove-words": ["of", "the", "The", "and", "text", "from"],
-      "ignore-words": ["The", "ATP"]
+      "ignore-words": ["The", "ATP"],
+      "word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"]


I don't think that we have the prepositions set up for all of the languages :)

Note that the above was a main justification for moving away from this, as it's 100% going to be a thing that will happen again and again as Scribe-Data develops and new word types are added to new languages. Best to just check what's actually in the language_data_extraction directories.

andrewtavis · 2024-06-22T13:14:05Z

src/scribe_data/resources/language_meta_data.json

    },
    {
      "language": "spanish",
      "iso": "es",
      "qid": "Q1321",
      "remove-words": ["of", "the", "The", "and"],
-      "ignore-words": []
+      "ignore-words": [],
+      "word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"]


Also looking into this, @mhmohona, it looks like you need to install an autoformatter as this would normally be taken care of. The one that I use is ruff :)

Note that I removed the word-types keys given that we're now using the directory structure to derive what can be queried 😊

andrewtavis · 2024-06-22T13:25:42Z

src/scribe_data/cli/utils.py

+LANGUAGE_METADATA = load_language_metadata()
+LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']}
+
+def print_formatted_data(data: Union[Dict, List], word_type: str) -> None:


Let's keep this copy of print_formatted_data as I think it makes sense to have it in utils :)

@mhmohona, is this function needed at this point? This was what you were using for the original outputs? I didn't see print_formatted_data referenced anywhere when I did my most recent review 🤔

andrewtavis · 2024-06-22T16:39:24Z

src/scribe_data/cli/query.py

+
+def query_data(language: str = None, word_type: str = None, output_dir: Optional[str] = None, overwrite: bool = False, output_type: Optional[str] = None) -> None:
+    if not (language and word_type):
+        print("Error: You must provide both --language (-l) and --word-type (-wt) options.")


For places like this, let's raise ValueError(...) :)

This way we don't need to return afterwards as well, and the error output will be similar to what people expect with the full trace rather than just a print statement.

andrewtavis

Thanks for all the hard work here, @mhmohona! I think this and the changes I just sent along set us up very nicely for the next couple of issues :) Lots of work still to be done, but so much progress has already been made!

Some notes here:

Please read the above comments for the feedback on style and coding conventions :)
I've added in a cli/convert.py file where all of the functionality to change a data type over to another should live
The list functionality is now based on the directory structure of Scribe-Data
- Reason for this is that we don't want to need to update the language metadata every time we implement something as maybe we forget and then ship something that's not functional solely because we didn't update the metadata
- As of now this means that "emoji_keywords" and "autosuggestions" are not showing up as options for the CLI (we can add these back in later!)
The CLI commands are now fully mapped out based on what we'd discussed, and I've also updated the description and put in an epilog that tells people to come here to the GitHub or to the docs 📝
We're now returning a table for listing word types as well so that --list outputs are consistent

I'll close out what issues can now be closed and write some further notes in the others!

Thanks again for the dedication and great work! 👏

add script for all language list

179184b

mhmohona added 2 commits June 5, 2024 09:13

Merge branch 'scribe-org:main' into cli

35813a6

add query word

7c4dde3

mhmohona added 2 commits June 6, 2024 06:29

Merge branch 'scribe-org:main' into cli

4fb06a9

update the commands

4731c6d

andrewtavis requested review from wkyoshida and andrewtavis June 6, 2024 07:11

add language code

fcec4e0

andrewtavis mentioned this pull request Jun 7, 2024

Implement query functionality within CLI #143

Closed

2 tasks

This was linked to issues Jun 7, 2024

Implement query functionality within CLI #143

Closed

Implement command structure for the CLI #136

Closed

andrewtavis mentioned this pull request Jun 7, 2024

Implement the CLI list-languages and list-word-types functionality #148

Closed

2 tasks

update as per requirement in scribe-org#148

e016703

wkyoshida reviewed Jun 9, 2024

View reviewed changes

wkyoshida and others added 7 commits June 8, 2024 22:28

Merge branch 'main' into cli

820ddc9

Merge branch 'scribe-org:main' into cli

66d66f1

update cli file structure

6e6da98

rename files, fix commands for list

2487d6d

changed alias for query into q

bc6c7da

getting lang info from language_meta_data.json

c434856

show formatted data from meta file

e1e8e68

add not implemented function

d39dd29

mhmohona requested a review from wkyoshida June 19, 2024 00:39

mhmohona added 2 commits June 19, 2024 07:50

added --output-dir and --overwrite - scribe-org#144

4f63cf0

implementation of scribe-org#146

afa4eef

andrewtavis reviewed Jun 22, 2024

View reviewed changes

andrewtavis added 5 commits June 22, 2024 19:52

Update CLI structure + refactoring

6958366

Switch over word type correction + file rename

10511ae

Remove word-type keys from language metadata

78ae17e

Remove word-type description from language metadata

eb74dff

File spacing and comment formatting

a212390

andrewtavis approved these changes Jun 22, 2024

View reviewed changes

andrewtavis merged commit 11f4f94 into scribe-org:main Jun 22, 2024
2 checks passed

		query_parser.add_argument('-l', '--language', required=True, help='Language code')
		query_parser.add_argument('-wt', '--word-type', required=True, help='Word type')

Scribe-data CLI tool implementation #140

Scribe-data CLI tool implementation #140

Conversation

mhmohona commented Jun 4, 2024 • edited Loading

Contributor checklist

Description

Related issue

github-actions bot commented Jun 4, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

mhmohona commented Jun 4, 2024

andrewtavis commented Jun 4, 2024

andrewtavis commented Jun 4, 2024

mhmohona commented Jun 5, 2024

andrewtavis commented Jun 5, 2024 • edited Loading

mhmohona commented Jun 6, 2024

andrewtavis commented Jun 6, 2024

andrewtavis commented Jun 6, 2024

wkyoshida left a comment

Choose a reason for hiding this comment

wkyoshida Jun 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis commented Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

andrewtavis Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis Jun 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andrewtavis left a comment • edited Loading

Choose a reason for hiding this comment

mhmohona commented Jun 4, 2024 •

edited

Loading

github-actions bot commented Jun 4, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Jun 5, 2024 •

edited

Loading

wkyoshida Jun 9, 2024 •

edited

Loading

andrewtavis commented Jun 22, 2024 •

edited

Loading

andrewtavis Jun 22, 2024 •

edited

Loading

andrewtavis Jun 22, 2024 •

edited

Loading

andrewtavis Jun 22, 2024 •

edited

Loading

andrewtavis Jun 22, 2024 •

edited

Loading

andrewtavis left a comment •

edited

Loading