Skip to content

Commit

Permalink
Merge pull request #454 from catreedle/wikidata
Browse files Browse the repository at this point in the history
move language_data_extraction under wikidata and lowercase languages
  • Loading branch information
andrewtavis authored Oct 23, 2024
2 parents 399efe2 + 7e0c521 commit 180950d
Show file tree
Hide file tree
Showing 397 changed files with 127 additions and 1,619 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz

The CLI commands defined within [scribe_data/cli](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/cli) and the notebooks within the various [scribe_data](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data) directories are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) once they're active.

The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.
The main data update process in triggers [language based SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran via the `scribe-data get -lang LANGUAGE -dt emoji-keywords` command.

<a id="cli-usage"></a>

Expand Down Expand Up @@ -197,7 +197,7 @@ See the [contribution guidelines](https://github.com/scribe-org/Scribe-Data/blob

# Supported Languages [`⇧`](#contents)

Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the [language_data_extraction](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction) directory for queries for currently supported languages and those that have substantial data on [Wikidata](https://www.wikidata.org/).
Scribe's goal is functional, feature-rich keyboards and interfaces for all languages. Check the [language_data_extraction](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction) directory for queries for currently supported languages and those that have substantial data on [Wikidata](https://www.wikidata.org/).
The following table shows the supported languages and the amount of data available for each on [Wikidata](https://www.wikidata.org/) and via [Unicode CLDR](https://github.com/unicode-org/cldr) for emojis:
Expand Down
3 changes: 0 additions & 3 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,11 +40,8 @@
"numpydoc",
"sphinx.ext.viewcode",
"sphinx.ext.imgmath",
"nbsphinx",
]

nbsphinx_allow_errors = True
nbsphinx_execute = "never"
numpydoc_show_inherited_class_members = False
numpydoc_show_class_members = False

Expand Down
1 change: 0 additions & 1 deletion docs/source/scribe_data/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ Scribe-Data
.. toctree::
:maxdepth: 2

language_data_extraction/index
load/index
unicode/index
wikidata/index
Expand Down
1 change: 1 addition & 0 deletions docs/source/scribe_data/wikidata/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ wikidata/
:maxdepth: 2

check_query/index
language_data_extraction/index

.. toctree::
:maxdepth: 1
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
language_data_extraction/
=========================

`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction>`_
`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/wikidata/language_data_extraction>`_

This directory contains all language extraction and formatting code for Scribe-Data. The structure is broken down by language, with each language sub-directory then including directories for nouns, prepositions, translations and verbs if needed. Within these data type directories are :code:`query_DATA_TYPE.sparql` SPARQL files that are ran to query Wikidata and then formatted with the given :code:`format_DATA_TYPE.py` Python files.

Expand Down
4 changes: 0 additions & 4 deletions docs/source/scribe_data/wikipedia/gen_autosuggestions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,4 @@ gen_autosuggestions.ipynb

This notebook is used to run the functions found in Scribe-Data to extract, clean and load autosuggestion files into Scribe apps.

.. toctree::

notebook.ipynb

Use the :code:`View code on GitHub` link above to view the notebook and explore the process!
308 changes: 0 additions & 308 deletions docs/source/scribe_data/wikipedia/notebook.ipynb

This file was deleted.

1 change: 0 additions & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,6 @@ flax>=0.8.2
iso639-lang>=2.2.3
m2r2>=0.3.3
mwparserfromhell>=0.6
nbsphinx>=0.9.5
numpydoc>=1.6.0
packaging>=20.9
pandas>=1.5.3
Expand Down
Loading

0 comments on commit 180950d

Please sign in to comment.