-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scribe-data CLI tool implementation #140
Conversation
Thank you for the pull request!The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :) If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. Also consider joining our bi-weekly Saturday dev syncs. It'd be great to have you! Maintainer checklist
|
@andrewtavis, @wkyoshida, I have worked on listing all language as this task seemed easier and wanted to take baby step towards the complete cli tool. Here is the output of all languages - The file name Will suggested first, which was |
I think that we could keep it as cli.py, as something that will happen is that when Scribe-Data is installed we should be able to directly access it from the command like without saying |
Do you want to check the linting and formatting checks for your commit, @mhmohona? No stress on the Mac build fail, sadly, but the formatting check did have some things that need to be fixed. You can see that above! This will hopefully get easier once the new contributor has the pre-commit issue done as the linting fixes will be done for you on commit 😊 |
Here is the update for
Now the question, for emoji keywords, auto suggestions, and translations files - shall I add about them as well? |
Hey @mhmohona 👋 Checking on this, is this getting the JSON values from the language_data_export directory, or running Another thing, maybe you could research how to look into implementing it so that we have the CLI installed when the package is installed. So in the installation instructions we have the following (pre-commit was just added and checks your commits - definitely suggested to adopt it 😊): pip install --upgrade pip # make sure that pip is at the latest version
pip install -r requirements.txt # install dependencies
pip install -e . # install the local version of Scribe-Data
pre-commit install # install pre-commit hooks
# pre-commit run --all-files # lint and fix common problems in the codebase What would be really great would be if the process of doing python3 src/scribe_data/cli.py query -l German -wt verbs we could instead do: scribe-data query -l German -wt verbs I'm assuming that this would be changes in setup.py or another installation setting 🤔 @wkyoshida, do you have an idea on this? |
@andrewtavis,so Ihave updated the commands. It now looks like this - Still needs to work on the language code, so instead of writing |
This is great, @mhmohona! Thanks for all the hard work! |
Requesting review from both of us eventually, @wkyoshida :) Let us know when it's ready for a final check, @mhmohona, and we're of course happy to answer questions along the way! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey! Awesome to see the CLI taking shape!! 🚀
Just adding some thoughts here - but I'm fine if we leave most of them for later as work on the CLI implementation continues.
For now, what if we just take a look at the two comments that I marked with a 📍 just so we get the initial command structure going so we can close off #136 ? We can leave the implementation for the commands to other PRs; that's fine with me 👍
src/scribe_data/cli.py
Outdated
query_parser.add_argument('-l', '--language', required=True, help='Language code') | ||
query_parser.add_argument('-wt', '--word-type', required=True, help='Word type') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't we want that neither --language
nor --word-type
be required perhaps? So that one would have the option to get all languages or all word types at once too?
I am thinking though that perhaps we might want a check that if one does specify --word-type
, --language
must also be specified in that case.
EDIT: Mistyped the first time. Meant to say "required" as opposed to "optional"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say that for some cases we might not want to include --language
, but that's also because I've been reconsidering the structure for #148. We have --language
and --word-type
as options, so maybe we could even do the following with a list
command that uses them as arguments:
scribe-data list --language # list all languages
scribe-data list --word-type # list all word types
scribe-data list --language German --word-type # list all German word types
scribe-data list --language --word-type nouns # list all languages that you can get nouns for
The functionality has generally been focussed on using --language
and --word-type
to subset, but maybe this could also work? I'm just kind of throwing this out :) Would this make sense @mhmohona and @wkyoshida?
Regardless, for list-word-types
--language
wouldn't be necessary :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh my comment above was regarding the query
command actually hahah, but yea, I do see what you're saying with the list*
commands 🤔 I feel a bit iffy with them too.. Let's try to hash it out in #148 to what might make more sense then!
If it makes sense - I was thinking of trying to keep this PR contained more so around getting #136 out of the way. Meaning - I'm OK if we just leave list-languages
and list-word-types
as is for now, while we decide on what to do
src/scribe_data/cli.py
Outdated
print("Available languages:") | ||
for lang in languages: | ||
print(f"- {lang}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using language_meta_data.json
, it could perhaps also be useful to output the "iso"
and "qid"
attributes as well for each language, along with the name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ya returning those would also be nice :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is so great, @mhmohona! Really set the standard for how list
looks now :)
src/scribe_data/cli.py
Outdated
word_types = [wt.stem for wt in language_dir.glob('*.json')] | ||
if not word_types: | ||
print(f"No word types available for language '{normalized_language}'.") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The available word types could simply be info that we grab from a new attribute "word-types"
or something that we add to the language_meta_data.json
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I moved away from this in the most recent commits as I realized that we were gonna have an issue where if we didn't update the language_metadata.json
(note the rename) file, then we'd lose functionality... Let's check on this, but I think a great way to go about this would be to assure that language_data_extraction
has a directory for each word type, and then from there we can just read in the structure for the language-word type combinations 😊
src/scribe_data/cli.py
Outdated
word_types = set() | ||
for lang_dir in DATA_DIR.iterdir(): | ||
if lang_dir.is_dir(): | ||
word_types.update(wt.stem for wt in lang_dir.glob('*.json')) | ||
|
||
if not word_types: | ||
print("No word types available.") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit iffy with simply outputting all the word types that any language might have support for, as that seems to be maybe not quite as useful or even misleading since some are available in some languages but not others.
Perhaps what could be good here instead is to list all languages and then all word types within each language?
Hey @mhmohona and @wkyoshida! 👋 I'm going to give this a look right now and fix up the test errors so we can bring this in :) We've got lots in here already, and I think the path ahead will be more clear once this is done and we can do one PR per issue 😊 |
@@ -47,6 +47,11 @@ | |||
long_description=long_description, | |||
long_description_content_type="text/markdown", | |||
url="https://github.com/scribe-org/Scribe-Data", | |||
entry_points={ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super great 😊
@@ -0,0 +1,153 @@ | |||
import json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need the license note that the top of this file, but I'm adding it to my local version of the PR now :)
src/scribe_data/cli/list.py
Outdated
from pathlib import Path | ||
from typing import Dict, List, Union | ||
|
||
# Load language metadata from JSON file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Full line comments should have a period at the end of them as they're complete sentences :)
src/scribe_data/cli/list.py
Outdated
with METADATA_FILE.open('r', encoding='utf-8') as file: | ||
return json.load(file) | ||
|
||
LANGUAGE_METADATA = load_language_metadata() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to switch these to just normal snake case as to me "screaming snake case" is for constants that are defined directly, not function results. METADATA_FILE
above is ok, but generally I'd consider:
A_STRING = "string"
A_NEW_STRING = f"new {A_STRING}"
AN_INT = 5
- etc
The METADATA_FILE
path should not change, but then we're going to be expanding language_metadata
from time to time.
Let me know if this makes sense, @mhmohona and @wkyoshida!
src/scribe_data/cli/list.py
Outdated
LANGUAGE_METADATA = load_language_metadata() | ||
LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']} | ||
|
||
DATA_DIR = Path('scribe_data_json_export') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will move this up so that all screaming snake case variables are defined together at the top of the file :)
src/scribe_data/cli/list.py
Outdated
# Load language metadata from JSON file | ||
METADATA_FILE = Path(__file__).parent.parent / 'resources' / 'language_meta_data.json' | ||
|
||
def load_language_metadata() -> Dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're also fine to just use the with
statement directly without the function :)
src/scribe_data/cli/list.py
Outdated
DATA_DIR = Path('scribe_data_json_export') | ||
|
||
def print_formatted_data(data: Union[Dict, List], word_type: str) -> None: | ||
if not data: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put a vertical space between the if-else statements so it's a bit easier to read them 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
if something:
True
# <- Space here so it's easy to see where each condition ends.
else:
False
src/scribe_data/cli/list.py
Outdated
return | ||
|
||
if word_type == 'autosuggestions': | ||
max_key_length = max((len(key) for key in data.keys()), default=0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The definition of max_key_length is the same for all of these, so we can move it above the first if-statement :)
src/scribe_data/cli/list.py
Outdated
print(data) | ||
|
||
def list_languages() -> None: | ||
languages = [lang for lang in LANGUAGE_METADATA['languages']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can just do list(language_metadata["languages"])
here :)
src/scribe_data/cli/list.py
Outdated
|
||
# Define column widths | ||
language_col_width = max(len(lang['language']) for lang in languages) + 2 | ||
iso_col_width = 5 # Length of "ISO" column header + padding |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lowercase length
as inline comments aren't full sentence - so no period and first word lowercase :) Full line comments are complete sentences with a capitalized first letter and a period at the end, as mentioned above :) I'll fix these!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I converted the lines here over to use the dictionary as above in case other languages we add later have longer QIDs :)
for lang in languages: | ||
print(f"{lang['language'].capitalize():<{language_col_width}} {lang['iso']:<{iso_col_width}} {lang['qid']:<{qid_col_width}}") | ||
|
||
def list_word_types(language: str = None) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick note, @mhmohona: let's for sure include documentation strings for all functions. You can check other files to see how to write them :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So:
def fxn():
"""
Docstring :)
"""
return
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The best part is that we'll be able to have these auto-documented via the docstrings in the Sphinx docs :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that we do need the structure of the docstrings to be consistent though as there's a parser from Numpy that will make the docs. Again, just copy the structure from other docstrings and all will be well 😊
""" | ||
Setup and commands for the Scribe-Data command line interface. | ||
|
||
.. raw:: html |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This copyright notice is the one I was talking about. We should have it at the top of each Python file and change the first line for the functionality of the file :)
src/scribe_data/cli/main.py
Outdated
parser = argparse.ArgumentParser(description='Scribe-Data CLI Tool') | ||
subparsers = parser.add_subparsers(dest='command', required=True) | ||
|
||
# List command |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that a good way to do this would be # MARK:
, which will further put a section in the minimap. I've been starting to do this in my Python code recently, and it's really great!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check out the new version of cli/main.py to see what I mean :) Here's a screenshot of my minimap in VS Code with all the MARK
s showing:
}, | ||
{ | ||
"language": "italian", | ||
"iso": "it", | ||
"qid": "Q652", | ||
"remove-words": ["of", "the", "The", "and", "text", "from"], | ||
"ignore-words": ["The", "ATP"] | ||
"ignore-words": ["The", "ATP"], | ||
"word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that we have the prepositions set up for all of the languages :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the above was a main justification for moving away from this, as it's 100% going to be a thing that will happen again and again as Scribe-Data develops and new word types are added to new languages. Best to just check what's actually in the language_data_extraction
directories.
}, | ||
{ | ||
"language": "spanish", | ||
"iso": "es", | ||
"qid": "Q1321", | ||
"remove-words": ["of", "the", "The", "and"], | ||
"ignore-words": [] | ||
"ignore-words": [], | ||
"word-types": ["nouns", "verbs", "translations", "emoji_keywords", "prepositions", "autosuggestions"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that I removed the word-types keys given that we're now using the directory structure to derive what can be queried 😊
src/scribe_data/cli/utils.py
Outdated
LANGUAGE_METADATA = load_language_metadata() | ||
LANGUAGE_MAP = {lang['language'].lower(): lang for lang in LANGUAGE_METADATA['languages']} | ||
|
||
def print_formatted_data(data: Union[Dict, List], word_type: str) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep this copy of print_formatted_data
as I think it makes sense to have it in utils :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhmohona, is this function needed at this point? This was what you were using for the original outputs? I didn't see print_formatted_data
referenced anywhere when I did my most recent review 🤔
src/scribe_data/cli/query.py
Outdated
|
||
def query_data(language: str = None, word_type: str = None, output_dir: Optional[str] = None, overwrite: bool = False, output_type: Optional[str] = None) -> None: | ||
if not (language and word_type): | ||
print("Error: You must provide both --language (-l) and --word-type (-wt) options.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For places like this, let's raise ValueError(...)
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way we don't need to return
afterwards as well, and the error output will be similar to what people expect with the full trace rather than just a print statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the hard work here, @mhmohona! I think this and the changes I just sent along set us up very nicely for the next couple of issues :) Lots of work still to be done, but so much progress has already been made!
Some notes here:
- Please read the above comments for the feedback on style and coding conventions :)
- I've added in a
cli/convert.py
file where all of the functionality to change a data type over to another should live - The list functionality is now based on the directory structure of Scribe-Data
- Reason for this is that we don't want to need to update the language metadata every time we implement something as maybe we forget and then ship something that's not functional solely because we didn't update the metadata
- As of now this means that "emoji_keywords" and "autosuggestions" are not showing up as options for the CLI (we can add these back in later!)
- The CLI commands are now fully mapped out based on what we'd discussed, and I've also updated the description and put in an epilog that tells people to come here to the GitHub or to the docs 📝
- We're now returning a table for listing word types as well so that
--list
outputs are consistent
I'll close out what issues can now be closed and write some further notes in the others!
Thanks again for the dedication and great work! 👏
Contributor checklist
Description
list available lang codes
scribe-data ll
scribe-data languages-list
list available word types per lang
scribe-data list-word-types -l German
scribe-data query -l English -wt nouns
scribe-data query -l English -wt verbs
scribe-data query -l English -wt translated_words
Related issue
Fixes
--output-dir
and--overwrite
functionality for a JSON file output #144.csv
and.tsv
--output-type
functionality viaconvert
#146list-languages
andlist-word-types
functionality #148