Merge branch 'main' into AK-Contributions-Emoji-Functionality

scribe-org · Oct 24, 2024 · e3e6870 · e3e6870
2 parents 8066f2e + 52c8363
commit e3e6870
Show file tree

Hide file tree

Showing 405 changed files with 6,359 additions and 4,872 deletions.
diff --git a/.github/ISSUE_TEMPLATE/documentation.yml b/.github/ISSUE_TEMPLATE/documentation.yml
@@ -0,0 +1,32 @@
+name: 📝 Documentation
+description: Suggest improvements or updates to the documentation of Scribe-Data.
+labels: ["documentation"]
+projects: ["scribe-org/1"]
+body:
+  - type: checkboxes
+    id: doc-enhancement
+    attributes:
+      label: Terms
+      options:
+        - label: I have searched all [open documentation issues](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aopen+is%3Aissue+label%3Adocumentation)
+          required: true
+        - label: I agree to follow Scribe-Data's [Code of Conduct](https://github.com/scribe-org/Scribe-Data/blob/main/.github/CODE_OF_CONDUCT.md)
+          required: true
+  - type: textarea
+    attributes:
+      label: Current Documentation
+      placeholder: |
+        Provide a brief description or link to the current documentation you want to enhance.
+    validations:
+      required: true
+  - type: textarea
+    attributes:
+      label: Suggested Enhancement
+      placeholder: |
+        Describe the improvements or changes you'd like to see in the documentation.
+    validations:
+      required: true
+  - type: markdown
+    attributes:
+      value: |
+        Thanks for helping improve our documentation!
diff --git a/.github/workflows/check_query_forms.yaml b/.github/workflows/check_query_forms.yaml
@@ -0,0 +1,46 @@
+name: Check Query Forms
+on:
+  push:
+    branches: [main]
+  pull_request:
+    branches: [main]
+    types: [opened, reopened, synchronize]
+
+jobs:
+  format_check:
+    strategy:
+      fail-fast: false
+      matrix:
+        os:
+          - ubuntu-latest
+        python-version:
+          - "3.9"
+
+    runs-on: ${{ matrix.os }}
+
+    name: Run Check Query Forms
+
+    steps:
+      - name: Checkout repository
+        uses: actions/checkout@v4
+
+      - name: Set up Python ${{ matrix.python-version }}
+        uses: actions/setup-python@v4
+        with:
+          python-version: ${{ matrix.python-version }}
+
+      - name: Add project root to PYTHONPATH
+        run: echo "PYTHONPATH=$(pwd)/src" >> $GITHUB_ENV
+
+      - name: Install dependencies
+        run: |
+          python -m pip install --upgrade pip
+          pip install -r requirements.txt
+
+      - name: Run check_query_forms.py
+        working-directory: ./src/scribe_data/check
+        run: python check_query_forms.py
+
+      - name: Post-run status
+        if: failure()
+        run: echo "Project SPARQL query forms check failed. Please fix the reported errors."
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -16,9 +16,9 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 
 - Scribe-Data is now a fully functional CLI.
   - Querying Wikidata lexicographical data can be done via the `--query` command ([#159](https://github.com/scribe-org/Scribe-Data/issues/159)).
-    - The output type of queries can be in JSON, CSV, TSV and SQLite, with conversions output types also being possible ([#145](https://github.com/scribe-org/Scribe-Data/issues/145), [#146](https://github.com/scribe-org/Scribe-Data/issues/146))
-    - Output paths can be set for query results ([#144](https://github.com/scribe-org/Scribe-Data/issues/144)).
-    - The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself ([#186](https://github.com/scribe-org/Scribe-Data/issues/186), [#157 ](https://github.com/scribe-org/Scribe-Data/issues/157)).
+  - The output type of queries can be in JSON, CSV, TSV and SQLite, with conversions output types also being possible ([#145](https://github.com/scribe-org/Scribe-Data/issues/145), [#146](https://github.com/scribe-org/Scribe-Data/issues/146))
+  - Output paths can be set for query results ([#144](https://github.com/scribe-org/Scribe-Data/issues/144)).
+  - The version of the CLI can be printed to the command line and the CLI can further be used to upgrade itself ([#186](https://github.com/scribe-org/Scribe-Data/issues/186), [#157 ](https://github.com/scribe-org/Scribe-Data/issues/157)).
   - Total Wikidata lexemes for languages and data types can be derived with the `--total` command ([#147](https://github.com/scribe-org/Scribe-Data/issues/147)).
   - Commands can be used via an interactive mode with the `--interactive` command ([#158](https://github.com/scribe-org/Scribe-Data/issues/158)).
 - Articles are removed from machine translations so they're more directly useful in Scribe applications ([#96](https://github.com/scribe-org/Scribe-Data/issues/96)).

diff --git a/README.md b/README.md
@@ -41,7 +41,7 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
 
 The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.
 
-The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
+The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
 
 <a id="cli-usage"></a>
 
@@ -197,7 +197,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob
 
 # Supported Languages [`⇧`](#contents)
 
-Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the [language_data_extraction](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction) directory for queries for currently supported languages and those that have substantial data on [Wikidata](https://www.wikidata.org/).
+Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the [language_data_extraction](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) directory for queries for currently supported languages and those that have substantial data on [Wikidata](https://www.wikidata.org/).
 
 The following table shows the supported languages and the amount of data available for each on [Wikidata](https://www.wikidata.org/) and via [Unicode CLDR](https://github.com/unicode-org/cldr) for emojis:
 

diff --git a/docs/source/_static/CONTRIBUTING.rst b/docs/source/_static/CONTRIBUTING.rst
@@ -16,7 +16,7 @@ Contents
 -  `First steps as a contributor <#first-steps-as-a-contributor>`__
 -  `Learning the tech stack <#learning-the-tech-stack>`__
 -  `Development environment <#development-environment>`__
--  `Issues and projects <#issues-projects>`__
+-  `Issues and projects <#issues-and-projects>`__
 -  `Bug reports <#bug-reports>`__
 -  `Feature requests <#feature-requests>`__
 -  `Pull requests <#pull-requests>`__

diff --git a/docs/source/notes.rst b/docs/source/notes.rst
@@ -1,9 +1,9 @@
-.. mdinclude:: _static/CONTRIBUTING.rst
+.. include:: _static/CONTRIBUTING.rst
 
 License
 =======
 
 .. literalinclude:: ../../LICENSE.txt
     :language: text
 
-.. mdinclude:: ../../CHANGELOG.md
+.. include:: ../../CHANGELOG.md
diff --git a/docs/source/scribe_data/cli.rst b/docs/source/scribe_data/cli.rst
@@ -56,20 +56,22 @@ Example output:
     $ scribe-data list
 
     Language     ISO  QID
-    -----------------------
+    ==========================
     English      en   Q1860
     ...
-    -----------------------
 
     Available data types: All languages
-    -----------------------------------
+    ===================================
     adjectives
     adverbs
     emoji-keywords
     nouns
+    personal-pronouns
+    postpositions
     prepositions
+    proper-nouns
     verbs
-    -----------------------------------
+
 
 
 
@@ -78,46 +80,48 @@ Example output:
     $scribe-data list --language
 
     Language     ISO  QID
-    -----------------------
+    ==========================
     English      en   Q1860
     ...
-    -----------------------
 
 
 .. code-block:: text
 
     $scribe-data list -dt
 
     Available data types: All languages
-    -----------------------------------
+    ===================================
     adjectives
     adverbs
     emoji-keywords
     nouns
+    personal-pronouns
+    postpositions
     prepositions
+    proper-nouns
     verbs
-    -----------------------------------
 
 
 .. code-block:: text
 
     $scribe-data list -a
 
     Language     ISO  QID
-    -----------------------
+    ==========================
     English      en   Q1860
     ...
-    -----------------------
 
     Available data types: All languages
-    -----------------------------------
+    ===================================
     adjectives
     adverbs
     emoji-keywords
     nouns
+    personal-pronouns
+    postpositions
     prepositions
+    proper-nouns
     verbs
-    -----------------------------------
 
 Get Command
 ~~~~~~~~~~~
@@ -137,6 +141,7 @@ Options:
 - ``-dt, --data-type DATA_TYPE``: The data type(s) to get.
 - ``-od, --output-dir OUTPUT_DIR``: The output directory path for results.
 - ``-ot, --output-type {json,csv,tsv}``: The output file type.
+- ``-ope, --outputs-per-entry OUTPUTS_PER_ENTRY``: How many outputs should be generated per data entry.
 - ``-o, --overwrite``: Whether to overwrite existing files (default: False).
 - ``-a, --all ALL``: Get all languages and data types.
 - ``-i, --interactive``: Run in interactive mode.
@@ -257,7 +262,7 @@ Examples:
 
 .. code-block:: text
 
-    $scribe-data total -lang English -dt nouns
+    $scribe-data total -lang English -dt nouns  # verbs, adjectives, etc
     Language: English
     Data type: nouns
     Total number of lexemes: 12345
@@ -278,7 +283,4 @@ Options:
 
 - ``-f, --file FILE``: The file to convert to a new type.
 - ``-ko, --keep-original``: Whether to keep the file to be converted (default: True).
-- ``-json, --to-json TO_JSON``: Convert the file to JSON format.
-- ``-csv, --to-csv TO_CSV``: Convert the file to CSV format.
-- ``-tsv, --to-tsv TO_TSV``: Convert the file to TSV format.
-- ``-sqlite, --to-sqlite TO_SQLITE``: Convert the file to SQLite format.
+- ``-ot, --output-type {json,csv,tsv,sqlite}``: The output file type.
diff --git a/docs/source/scribe_data/index.rst b/docs/source/scribe_data/index.rst
@@ -6,7 +6,6 @@ Scribe-Data
 .. toctree::
     :maxdepth: 2
 
-    language_data_extraction/index
     load/index
     unicode/index
     wikidata/index

diff --git a/docs/source/scribe_data/wikidata/index.rst b/docs/source/scribe_data/wikidata/index.rst
@@ -7,6 +7,7 @@ wikidata/
     :maxdepth: 2
 
     check_query/index
+    language_data_extraction/index
 
 .. toctree::
     :maxdepth: 1

diff --git a/...e_data/language_data_extraction/index.rst → ...kidata/language_data_extraction/index.rst b/...e_data/language_data_extraction/index.rst → ...kidata/language_data_extraction/index.rst
@@ -1,7 +1,7 @@
 language_data_extraction/
 =========================
 
-`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction>`_
+`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction>`_
 
 This directory contains all language extraction and formatting code for Scribe-Data. The structure is broken down by language, with each language sub-directory then including directories for nouns, prepositions, translations and verbs if needed. Within these data type directories are :code:`query_DATA_TYPE.sparql` SPARQL files that are ran to query Wikidata and then formatted with the given :code:`format_DATA_TYPE.py` Python files.
 

diff --git a/docs/source/scribe_data/wikidata/query_profanity.rst b/docs/source/scribe_data/wikidata/query_profanity.rst
@@ -24,8 +24,7 @@ Queries all profane words from a given language to be removed from autosuggest o
         }.
 
         FILTER EXISTS {?sense wdt:P6191 ?filter.}.
-
-        }
+    }
 
     ORDER BY
         lcase(?lemma)

diff --git a/docs/source/scribe_data/wikipedia/gen_autosuggestions.rst b/docs/source/scribe_data/wikipedia/gen_autosuggestions.rst
@@ -3,9 +3,6 @@ gen_autosuggestions.ipynb
 
 `View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb>`_
 
-Scribe Autosuggest Generation
------------------------------
-
 This notebook is used to run the functions found in Scribe-Data to extract, clean and load autosuggestion files into Scribe apps.
 
 Use the :code:`View code on GitHub` link above to view the notebook and explore the process!
diff --git a/src/scribe_data/check/check_project_structure.py b/src/scribe_data/check/check_project_structure.py
@@ -26,17 +26,17 @@
 
 import os
 
-from scribe_data.cli.cli_utils import (
+from scribe_data.utils import (
     LANGUAGE_DATA_EXTRACTION_DIR,
     data_type_metadata,
     language_metadata,
 )
 
 # Expected languages and data types.
-LANGUAGES = [lang.capitalize() for lang in language_metadata.keys()]
+LANGUAGES = list(language_metadata.keys())
 DATA_TYPES = data_type_metadata.keys()
 SUB_DIRECTORIES = {
-    k.capitalize(): [lang.capitalize() for lang in v["sub_languages"].keys()]
+    k: list(v["sub_languages"].keys())
     for k, v in language_metadata.items()
     if len(v.keys()) == 1 and "sub_languages" in v.keys()
 }