Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scribe-data CLI tool implementation #140

Merged
merged 22 commits into from
Jun 22, 2024
Merged

Conversation

mhmohona
Copy link
Member

@mhmohona mhmohona commented Jun 4, 2024

Contributor checklist


Description

  1. list-languages (-ll)
  • list available lang codes

    • commands:
      scribe-data ll

      scribe-data languages-list
  • list available word types per lang

    • commands:
      scribe-data list-word-types -l German
  1. language (-l) and word-type (--wt)
  • commands:
    scribe-data query -l English -wt nouns

    scribe-data query -l English -wt verbs

    scribe-data query -l English -wt translated_words

Related issue

Fixes

Copy link

github-actions bot commented Jun 4, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@mhmohona
Copy link
Member Author

mhmohona commented Jun 4, 2024

@andrewtavis, @wkyoshida, I have worked on listing all language as this task seemed easier and wanted to take baby step towards the complete cli tool. Here is the output of all languages -
image

The file name Will suggested first, which was scribe-data.py seems better now, as it would make the commands look pretty. Also I think we need to move the cli script in root directory, so that commands become simple.

@andrewtavis
Copy link
Member

I think that we could keep it as cli.py, as something that will happen is that when Scribe-Data is installed we should be able to directly access it from the command like without saying python3 .... With that we should be able to name it what we want, and then we'd of course change the cli trigger to scribe-data as Will you suggested 😊

@andrewtavis
Copy link
Member

Do you want to check the linting and formatting checks for your commit, @mhmohona? No stress on the Mac build fail, sadly, but the formatting check did have some things that need to be fixed. You can see that above!

This will hopefully get easier once the new contributor has the pre-commit issue done as the linting fixes will be done for you on commit 😊

@mhmohona
Copy link
Member Author

mhmohona commented Jun 5, 2024

Here is the update for query.

python3 src/scribe_data/cli.py query -l German -wt nouns

image

python3 src/scribe_data/cli.py query -l German -wt verbs

image

python3 src/scribe_data/cli.py query -l Russian -wt translated_words

image

Now the question, for emoji keywords, auto suggestions, and translations files - shall I add about them as well?
Secondly, is the formatting ok? Or shall I put it in table?

@andrewtavis
Copy link
Member

andrewtavis commented Jun 5, 2024

Hey @mhmohona 👋 Checking on this, is this getting the JSON values from the language_data_export directory, or running update_data.py given the arguments? The latter would be the planned functionality, but we can also works towards it!

Another thing, maybe you could research how to look into implementing it so that we have the CLI installed when the package is installed. So in the installation instructions we have the following (pre-commit was just added and checks your commits - definitely suggested to adopt it 😊):

pip install --upgrade pip  # make sure that pip is at the latest version
pip install -r requirements.txt  # install dependencies
pip install -e .  # install the local version of Scribe-Data
pre-commit install  # install pre-commit hooks
# pre-commit run --all-files  # lint and fix common problems in the codebase

What would be really great would be if the process of doing pip install -e . would mean that rather than:

python3 src/scribe_data/cli.py query -l German -wt verbs

we could instead do:

scribe-data query -l German -wt verbs

I'm assuming that this would be changes in setup.py or another installation setting 🤔 @wkyoshida, do you have an idea on this?

@mhmohona
Copy link
Member Author

mhmohona commented Jun 6, 2024

@andrewtavis,so Ihave updated the commands. It now looks like this -

image

Still needs to work on the language code, so instead of writing English,only en would work.

@andrewtavis
Copy link
Member

This is great, @mhmohona! Thanks for all the hard work!

@andrewtavis
Copy link
Member

Requesting review from both of us eventually, @wkyoshida :) Let us know when it's ready for a final check, @mhmohona, and we're of course happy to answer questions along the way!

Copy link
Member

@wkyoshida wkyoshida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey! Awesome to see the CLI taking shape!! 🚀

Just adding some thoughts here - but I'm fine if we leave most of them for later as work on the CLI implementation continues.
For now, what if we just take a look at the two comments that I marked with a 📍 just so we get the initial command structure going so we can close off #136 ? We can leave the implementation for the commands to other PRs; that's fine with me 👍

Comment on lines 178 to 179
query_parser.add_argument('-l', '--language', required=True, help='Language code')
query_parser.add_argument('-wt', '--word-type', required=True, help='Word type')
Copy link
Member

@wkyoshida wkyoshida Jun 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't we want that neither --language nor --word-type be required perhaps? So that one would have the option to get all languages or all word types at once too?

I am thinking though that perhaps we might want a check that if one does specify --word-type, --language must also be specified in that case.


EDIT: Mistyped the first time. Meant to say "required" as opposed to "optional"

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that for some cases we might not want to include --language, but that's also because I've been reconsidering the structure for #148. We have --language and --word-type as options, so maybe we could even do the following with a list command that uses them as arguments:

scribe-data list --language  # list all languages
scribe-data list --word-type  # list all word types
scribe-data list --language German --word-type  # list all German word types
scribe-data list --language --word-type nouns  # list all languages that you can get nouns for

The functionality has generally been focussed on using --language and --word-type to subset, but maybe this could also work? I'm just kind of throwing this out :) Would this make sense @mhmohona and @wkyoshida?

Regardless, for list-word-types --language wouldn't be necessary :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh my comment above was regarding the query command actually hahah, but yea, I do see what you're saying with the list* commands 🤔 I feel a bit iffy with them too.. Let's try to hash it out in #148 to what might make more sense then!

If it makes sense - I was thinking of trying to keep this PR contained more so around getting #136 out of the way. Meaning - I'm OK if we just leave list-languages and list-word-types as is for now, while we decide on what to do

src/scribe_data/cli.py Outdated Show resolved Hide resolved
src/scribe_data/cli.py Outdated Show resolved Hide resolved
src/scribe_data/cli.py Outdated Show resolved Hide resolved
Comment on lines 57 to 59
print("Available languages:")
for lang in languages:
print(f"- {lang}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using language_meta_data.json, it could perhaps also be useful to output the "iso" and "qid" attributes as well for each language, along with the name

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ya returning those would also be nice :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting data like this -

image

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is so great, @mhmohona! Really set the standard for how list looks now :)

Comment on lines 74 to 77
word_types = [wt.stem for wt in language_dir.glob('*.json')]
if not word_types:
print(f"No word types available for language '{normalized_language}'.")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The available word types could simply be info that we grab from a new attribute "word-types" or something that we add to the language_meta_data.json

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I moved away from this in the most recent commits as I realized that we were gonna have an issue where if we didn't update the language_metadata.json (note the rename) file, then we'd lose functionality... Let's check on this, but I think a great way to go about this would be to assure that language_data_extraction has a directory for each word type, and then from there we can just read in the structure for the language-word type combinations 😊

Comment on lines 84 to 91
word_types = set()
for lang_dir in DATA_DIR.iterdir():
if lang_dir.is_dir():
word_types.update(wt.stem for wt in lang_dir.glob('*.json'))

if not word_types:
print("No word types available.")
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit iffy with simply outputting all the word types that any language might have support for, as that seems to be maybe not quite as useful or even misleading since some are available in some languages but not others.

Perhaps what could be good here instead is to list all languages and then all word types within each language?

src/scribe_data/cli.py Outdated Show resolved Hide resolved
@mhmohona mhmohona requested a review from wkyoshida June 19, 2024 00:39
@andrewtavis
Copy link
Member

andrewtavis commented Jun 22, 2024

Hey @mhmohona and @wkyoshida! 👋 I'm going to give this a look right now and fix up the test errors so we can bring this in :) We've got lots in here already, and I think the path ahead will be more clear once this is done and we can do one PR per issue 😊

@@ -47,6 +47,11 @@
long_description=long_description,
long_description_content_type="text/markdown",
url="https://github.com/scribe-org/Scribe-Data",
entry_points={
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super great 😊

@@ -0,0 +1,153 @@
import json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need the license note that the top of this file, but I'm adding it to my local version of the PR now :)

from pathlib import Path
from typing import Dict, List, Union

# Load language metadata from JSON file
Copy link
Member

@andrewtavis andrewtavis Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Full line comments should have a period at the end of them as they're complete sentences :)

with METADATA_FILE.open('r', encoding='utf-8') as file:
return json.load(file)

LANGUAGE_METADATA = load_language_metadata()
Copy link
Member

@andrewtavis andrewtavis Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to switch these to just normal snake case as to me "screaming snake case" is for constants that are defined directly, not function results. METADATA_FILE above is ok, but generally I'd consider:

  • A_STRING = "string"
  • A_NEW_STRING = f"new {A_STRING}"
  • AN_INT = 5
  • etc

The METADATA_FILE path should not change, but then we're going to be expanding language_metadata from time to time.

Let me know if this makes sense, @mhmohona and @wkyoshida!

LANGUAGE_METADATA = load_language_metadata()
LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']}

DATA_DIR = Path('scribe_data_json_export')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will move this up so that all screaming snake case variables are defined together at the top of the file :)

# Load language metadata from JSON file
METADATA_FILE = Path(__file__).parent.parent / 'resources' / 'language_meta_data.json'

def load_language_metadata() -> Dict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we're also fine to just use the with statement directly without the function :)

DATA_DIR = Path('scribe_data_json_export')

def print_formatted_data(data: Union[Dict, List], word_type: str) -> None:
if not data:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put a vertical space between the if-else statements so it's a bit easier to read them 🤓

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

if something:
   True
# <- Space here so it's easy to see where each condition ends. 
else:
    False

return

if word_type == 'autosuggestions':
max_key_length = max((len(key) for key in data.keys()), default=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The definition of max_key_length is the same for all of these, so we can move it above the first if-statement :)

print(data)

def list_languages() -> None:
languages = [lang for lang in LANGUAGE_METADATA['languages']]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can just do list(language_metadata["languages"]) here :)


# Define column widths
language_col_width = max(len(lang['language']) for lang in languages) + 2
iso_col_width = 5 # Length of "ISO" column header + padding
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lowercase length as inline comments aren't full sentence - so no period and first word lowercase :) Full line comments are complete sentences with a capitalized first letter and a period at the end, as mentioned above :) I'll fix these!

Copy link
Member

@andrewtavis andrewtavis Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I converted the lines here over to use the dictionary as above in case other languages we add later have longer QIDs :)

for lang in languages:
print(f"{lang['language'].capitalize():<{language_col_width}} {lang['iso']:<{iso_col_width}} {lang['qid']:<{qid_col_width}}")

def list_word_types(language: str = None) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick note, @mhmohona: let's for sure include documentation strings for all functions. You can check other files to see how to write them :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So:

def fxn():
    """
    Docstring :)
    """
    return

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The best part is that we'll be able to have these auto-documented via the docstrings in the Sphinx docs :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that we do need the structure of the docstrings to be consistent though as there's a parser from Numpy that will make the docs. Again, just copy the structure from other docstrings and all will be well 😊

"""
Setup and commands for the Scribe-Data command line interface.

.. raw:: html
Copy link
Member

@andrewtavis andrewtavis Jun 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This copyright notice is the one I was talking about. We should have it at the top of each Python file and change the first line for the functionality of the file :)

parser = argparse.ArgumentParser(description='Scribe-Data CLI Tool')
subparsers = parser.add_subparsers(dest='command', required=True)

# List command
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that a good way to do this would be # MARK:, which will further put a section in the minimap. I've been starting to do this in my Python code recently, and it's really great!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check out the new version of cli/main.py to see what I mean :) Here's a screenshot of my minimap in VS Code with all the MARKs showing:

Screenshot_2024-06-22 22 06 29

},
{
"language": "italian",
"iso": "it",
"qid": "Q652",
"remove-words": ["of", "the", "The", "and", "text", "from"],
"ignore-words": ["The", "ATP"]
"ignore-words": ["The", "ATP"],
"word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that we have the prepositions set up for all of the languages :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the above was a main justification for moving away from this, as it's 100% going to be a thing that will happen again and again as Scribe-Data develops and new word types are added to new languages. Best to just check what's actually in the language_data_extraction directories.

},
{
"language": "spanish",
"iso": "es",
"qid": "Q1321",
"remove-words": ["of", "the", "The", "and"],
"ignore-words": []
"ignore-words": [],
"word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also looking into this, @mhmohona, it looks like you need to install an autoformatter as this would normally be taken care of. The one that I use is ruff :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that I removed the word-types keys given that we're now using the directory structure to derive what can be queried 😊

LANGUAGE_METADATA = load_language_metadata()
LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']}

def print_formatted_data(data: Union[Dict, List], word_type: str) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep this copy of print_formatted_data as I think it makes sense to have it in utils :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhmohona, is this function needed at this point? This was what you were using for the original outputs? I didn't see print_formatted_data referenced anywhere when I did my most recent review 🤔


def query_data(language: str = None, word_type: str = None, output_dir: Optional[str] = None, overwrite: bool = False, output_type: Optional[str] = None) -> None:
if not (language and word_type):
print("Error: You must provide both --language (-l) and --word-type (-wt) options.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For places like this, let's raise ValueError(...) :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way we don't need to return afterwards as well, and the error output will be similar to what people expect with the full trace rather than just a print statement.

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the hard work here, @mhmohona! I think this and the changes I just sent along set us up very nicely for the next couple of issues :) Lots of work still to be done, but so much progress has already been made!

Some notes here:

  • Please read the above comments for the feedback on style and coding conventions :)
  • I've added in a cli/convert.py file where all of the functionality to change a data type over to another should live
  • The list functionality is now based on the directory structure of Scribe-Data
    • Reason for this is that we don't want to need to update the language metadata every time we implement something as maybe we forget and then ship something that's not functional solely because we didn't update the metadata
    • As of now this means that "emoji_keywords" and "autosuggestions" are not showing up as options for the CLI (we can add these back in later!)
  • The CLI commands are now fully mapped out based on what we'd discussed, and I've also updated the description and put in an epilog that tells people to come here to the GitHub or to the docs 📝
  • We're now returning a table for listing word types as well so that --list outputs are consistent

I'll close out what issues can now be closed and write some further notes in the others!

Thanks again for the dedication and great work! 👏

@andrewtavis andrewtavis merged commit 11f4f94 into scribe-org:main Jun 22, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement query functionality within CLI Implement command structure for the CLI
3 participants