Skip to content

Commit

Permalink
Retarget all processes to data export directory in root
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewtavis committed Jun 2, 2024
1 parent 7553fd6 commit 06e60a6
Show file tree
Hide file tree
Showing 67 changed files with 61 additions and 50 deletions.
6 changes: 4 additions & 2 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ Scribe-Data tries to follow [semantic versioning](https://semver.org/), a MAJOR.

Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).

## [Upcoming] Scribe-Data 3.3.0
## [Upcoming] Scribe-Data 4.0.0

### ✨ Features

Expand All @@ -29,11 +29,13 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
- The `_update_files` directory was renamed `update_files` as these files are used in non-internal manners now ([#57](https://github.com/scribe-org/Scribe-Data/issues/57)).
- A common function has been created to map Wikidata ids to noun genders ([#69](https://github.com/scribe-org/Scribe-Data/issues/69)).
- The project now is installed locally for development and command line usage, so usages of `sys.path` have been removed from files ([#122](https://github.com/scribe-org/Scribe-Data/issues/122)).
- The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from (Wiktionary).
- The directory structure has been dramatically streamlined and includes folders for future projects where language data could come from other sources like Wiktionary ([#139](https://github.com/scribe-org/Scribe-Data/issues/139)).
- Translation files are moved to their own directory.
- The `extract_transform` directory has been removed and all files within it have been moved one level up.
- The `languages` directory has been renamed `language_data_extraction`.
- All files within `wikidata/_resources` have been moved to the `resources` directory.
- The gender and case annotations for data formatting have now been commonly defined.
- All language directory `formatted_data` files have been now moved to the `language_data_export` directory to prepare for outputs being required to be directed to a directory outside of the package.

## Scribe-Data 3.2.2

Expand Down
2 changes: 1 addition & 1 deletion docs/source/scribe_data/language_data_extraction/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@ language_data_extraction

`View code on Github <https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/language_data_extraction>`_

This directory contains all language extraction and formatting code for Scribe-Data. The structure is broken down by language, with each language sub-directory then including directories for nouns, prepositions, translations and verbs if needed. Within these word type directories are :code:`query_WORD_TYPE.sparql` SPARQL files that are ran to query Wikidata and then formatted with the given :code:`format_WORD_TYPE.py` Python files. Included in each language sub-directory is also a :code:`formatted_data` directory that includes the outputs of all word type query and formatting processes.
This directory contains all language extraction and formatting code for Scribe-Data. The structure is broken down by language, with each language sub-directory then including directories for nouns, prepositions, translations and verbs if needed. Within these word type directories are :code:`query_WORD_TYPE.sparql` SPARQL files that are ran to query Wikidata and then formatted with the given :code:`format_WORD_TYPE.py` Python files.

Use the :code:`View code on GitHub` link above to view the directory and explore the process!
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,8 +25,10 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)

if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
translations = json.load(file)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
file_path=file_path, language=LANGUAGE, data_type=DATA_TYPE
)


nouns_formatted = {}

for noun_vals in nouns_list:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
file_path=file_path, language=LANGUAGE, data_type=DATA_TYPE
)


nouns_formatted = {}

for noun_vals in nouns_list:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
file_path=file_path, language=LANGUAGE, data_type=DATA_TYPE
)


nouns_formatted = {}

for noun_vals in nouns_list:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
file_path=file_path, language=LANGUAGE, data_type=DATA_TYPE
)


nouns_formatted = {}

for noun_vals in nouns_list:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,6 @@
file_path=file_path, language=LANGUAGE, data_type=DATA_TYPE
)


nouns_formatted = {}

for noun_vals in nouns_list:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@

import json
import os
import sys

from scribe_data.translation.translation_utils import (
translate_to_other_languages,
Expand All @@ -24,7 +25,8 @@

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
translate_script_dir,
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{SRC_LANG}/translated_words.json",
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
Expand Down
8 changes: 6 additions & 2 deletions src/scribe_data/load/data_to_sqlite.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,9 @@
language_word_type_dict = {
lang: [
f.split(".json")[0]
for f in os.listdir(f"{PATH_TO_LANGUAGE_DIRS}{lang}/formatted_data")
for f in os.listdir(
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{lang}"
)
if f.split(".json")[0] in word_types
]
for lang in languages_update
Expand Down Expand Up @@ -139,7 +141,9 @@ def table_insert(word_type, keys):
for wt in language_word_type_dict[lang]:
print(f"Creating {lang} {wt} table...")
json_data = json.load(
open(f"{PATH_TO_LANGUAGE_DIRS}{lang}/formatted_data/{wt}.json")
open(
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{lang}/{wt}.json"
)
)

if wt == "nouns":
Expand Down
7 changes: 4 additions & 3 deletions src/scribe_data/translation/translation_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,13 @@
"""

import json
import os
import signal
import sys

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

from scribe_data.utils import (
get_language_dir_path,
get_language_iso,
get_target_langcodes,
)
Expand All @@ -31,7 +32,7 @@ def translation_interrupt_handler(source_language, translations):
)

with open(
f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{source_language}/translated_words.json",
"w",
encoding="utf-8",
) as file:
Expand Down Expand Up @@ -90,7 +91,7 @@ def translate_to_other_languages(source_language, word_list, translations, batch
print(f"Batch {i//batch_size + 1} translation completed.")

with open(
f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{source_language}/translated_words.json",
"w",
encoding="utf-8",
) as file:
Expand Down
11 changes: 8 additions & 3 deletions src/scribe_data/unicode/process_unicode.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,8 @@
import csv
import fileinput
import json
import os
import sys
from importlib.resources import files

import emoji
Expand Down Expand Up @@ -54,7 +56,7 @@ def gen_emoji_lexicon(
Whether to export whether the emojis is a base character as well as its rank.
update_local_data : bool (default=False)
Saves the created dictionaries as JSONs in the local formatted_data directories.
Saves the created dictionaries as JSONs in the target directories.
verbose : bool (default=True)
Whether to show a tqdm progress bar for the process.
Expand Down Expand Up @@ -167,7 +169,10 @@ def gen_emoji_lexicon(
)

# Check nouns files for plurals and update their data with the emojis for their singular forms.
with open(f"./{language}/formatted_data/nouns.json", encoding="utf-8") as f:
with open(
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{language}/nouns.json",
encoding="utf-8",
) as f:
noun_data = json.load(f)

plurals_to_singulars_dict = {
Expand Down Expand Up @@ -209,7 +214,7 @@ def gen_emoji_lexicon(
if update_local_data:
path_to_formatted_data = (
get_path_from_wikidata_dir()
+ f"/Scribe-Data/src/scribe_data/language_data_extraction/{language.capitalize()}/formatted_data/emoji_keywords.json"
+ f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{language}/emoji_keywords.json"
)

with open(path_to_formatted_data, "w", encoding="utf-8") as file:
Expand Down
29 changes: 7 additions & 22 deletions src/scribe_data/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -217,24 +217,6 @@ def get_language_words_to_ignore(language: str) -> list[str]:
)


def get_language_dir_path(language):
"""
Returns the directory path for a specific language within the Scribe-Data project.
Parameters
----------
language : str
The language for which the directory path is needed.
Returns
-------
str
The directory path for the specified language.
"""
PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
return f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/language_data_extraction/{language}"


def load_queried_data(file_path, language, data_type):
"""
Loads queried data from a JSON file for a specific language and data type.
Expand All @@ -261,7 +243,9 @@ def load_queried_data(file_path, language, data_type):
data_path = queried_data_file
else:
update_data_in_use = True
data_path = f"{get_language_dir_path(language)}/{data_type}/{queried_data_file}"
PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
LANG_DIR_PATH = f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/language_data_extraction/{language}"
data_path = f"{LANG_DIR_PATH}/{data_type}/{queried_data_file}"

with open(data_path, encoding="utf-8") as f:
return json.load(f), update_data_in_use, data_path
Expand All @@ -287,14 +271,15 @@ def export_formatted_data(formatted_data, update_data_in_use, language, data_typ
None
"""
if update_data_in_use:
export_path = (
f"{get_language_dir_path(language)}/formatted_data/{data_type}.json"
)
PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
export_path = f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/language_data_export/{language}/{data_type}.json"

else:
export_path = f"{data_type}.json"

with open(export_path, "w", encoding="utf-8") as file:
json.dump(formatted_data, file, ensure_ascii=False, indent=0)

print(f"Wrote file {data_type}.json with {len(formatted_data):,} {data_type}.")


Expand Down
4 changes: 2 additions & 2 deletions src/scribe_data/wikidata/update_data.py
Original file line number Diff line number Diff line change
Expand Up @@ -215,9 +215,9 @@
f"python {PATH_TO_LANGUAGE_EXTRACTION_FILES}/{lang}/{target_type}/format_{target_type}.py"
)

# Check current data within for formatted_data directories.
# Check current data within formatted data directories.
with open(
f"{PATH_TO_LANGUAGE_EXTRACTION_FILES}/{lang.capitalize()}/formatted_data/{target_type}.json",
f"{os.path.dirname(sys.path[0]).split('scribe_data')[0]}/../language_data_export/{lang.capitalize()}/{target_type}.json",
encoding="utf-8",
) as json_file:
new_keyboard_data = json.load(json_file)
Expand Down
Loading

0 comments on commit 06e60a6

Please sign in to comment.