#75 Italian translation process and reorder directory structure

scribe-org · Mar 24, 2024 · 2b72e64 · 2b72e64
1 parent 02f220e
commit 2b72e64
Show file tree

Hide file tree

Showing 22 changed files with 213 additions and 121 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -12,10 +12,7 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 
 ## [Upcoming] Scribe-Data 3.3.0
 
-<!-- - The translation process has been updated to allow for translations from non-English languages.
-  - Scribe-Data now outputs an SQLite table that has keys for target languages for each base language. -->
-<!-- - English has been added to the data ETL process. -->
-
+- The translation process has been updated to allow for translations from non-English languages ([#72](https://github.com/scribe-org/Scribe-Data/issues/72), [#73](https://github.com/scribe-org/Scribe-Data/issues/73), [#74](https://github.com/scribe-org/Scribe-Data/issues/74), [#75](https://github.com/scribe-org/Scribe-Data/issues/75), [#75](https://github.com/scribe-org/Scribe-Data/issues/75), [#76](https://github.com/scribe-org/Scribe-Data/issues/76), [#77](https://github.com/scribe-org/Scribe-Data/issues/77), [#78](https://github.com/scribe-org/Scribe-Data/issues/78), [#79](https://github.com/scribe-org/Scribe-Data/issues/79)).
 - The documentation has been given a new layout with the logo in the top left ([#90](https://github.com/scribe-org/Scribe-Data/issues/90)).
 - The documentation now has links to the code at the top of each page ([#91](https://github.com/scribe-org/Scribe-Data/issues/91)).
 
@@ -25,6 +22,8 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
 - A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request ([#109](https://github.com/scribe-org/Scribe-Data/issues/109)).
 - The `_update_files` directory was renamed `update_files` as these files are used in non-internal manners now ([#57](https://github.com/scribe-org/Scribe-Data/issues/57)).
 - A common function has been created to map Wikidata ids to noun genders ([#69](https://github.com/scribe-org/Scribe-Data/issues/69)).
+- Files in the `extract_transform` directory were moved based on if they access Wikidata, Wikipedia or Unicode.
+  - Translation files are further moved to their own directory.
 
 ## Scribe-Data 3.2.2
 

diff --git a/README.md b/README.md
@@ -15,7 +15,7 @@
 
 ## Wikidata and Wikipedia language data extraction
 
-**Scribe-Data** contains the scripts for extracting and formatting data from [Wikidata](https://www.wikidata.org/) and [Wikipedia](https://www.wikipedia.org/) for Scribe applications. Updates to the language keyboard and interface data can be done using [scribe_data/load/update_data.py](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load/update_data.py) and the notebooks within the [scribe_data/load](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load) directory.
+**Scribe-Data** contains the scripts for extracting and formatting data from [Wikidata](https://www.wikidata.org/) and [Wikipedia](https://www.wikipedia.org/) for Scribe applications. Updates to the language keyboard and interface data can be done using [scribe_data/extract_transform/wikidata/update_data.py](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/wikidata/update_data.py) and the notebooks within the [scribe_data/load](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load) directory.
 
 > [!NOTE]\
 > The [contributing](#contributing) section has information for those interested, with the articles and presentations in [featured by](#featured-by) also being good resources for learning more about Scribe.
@@ -38,14 +38,14 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz
 
 # Process [`⇧`](#contents)
 
-[scribe_data/extract_transform/update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) and the notebooks within the [scribe_data/extract_transform](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform) directory are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) when they're active.
+[scribe_data/extract_transform/wikidata/update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) and the notebooks within the [scribe_data/extract_transform](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform) directory are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) when they're active.
 
-The main data update process in [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) triggers [SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/languages) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran in [gen_emoji_lexicon.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb).
+The main data update process in [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) triggers [SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/languages) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran in [gen_emoji_lexicon.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb).
 
-Running [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) is done via the following CLI command:
+Running [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) is done via the following CLI command:
 
 ```bash
-python3 src/scribe_data/extract_transform/update_data.py
+python3 src/scribe_data/extract_transform/wikidata/update_data.py
 ```
 
 The ultimate goal is that this repository will house language packs that are periodically updated with new [Wikidata](https://www.wikidata.org/) lexicographical data and data from other sources. These packs would then be available to download by users of Scribe applications.

diff --git a/src/scribe_data/extract_transform/languages/English/translations/translate_words.py b/src/scribe_data/extract_transform/languages/English/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "English"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/French/translations/translate_words.py b/src/scribe_data/extract_transform/languages/French/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "French"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/German/translations/translate_words.py b/src/scribe_data/extract_transform/languages/German/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "German"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/Italian/translations/translate_words.py b/src/scribe_data/extract_transform/languages/Italian/translations/translate_words.py
@@ -0,0 +1,43 @@
+"""
+Translates the Italian words queried from Wikidata to all other Scribe languages.
+
+Example
+-------
+    python3 src/scribe_data/extract_transform/languages/Italian/translations/translate_words.py
+"""
+
+import json
+import os
+import sys
+
+PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
+PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
+sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
+
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
+
+SRC_LANG = "Italian"
+translate_script_dir = os.path.dirname(os.path.abspath(__file__))
+words_to_translate_path = os.path.join(translate_script_dir, "words_to_translate.json")
+
+with open(words_to_translate_path, "r", encoding="utf-8") as file:
+    json_data = json.load(file)
+
+word_list = [item["word"] for item in json_data]
+
+translations = {}
+translated_words_path = os.path.join(
+    translate_script_dir, "../formatted_data/translated_words.json"
+)
+if os.path.exists(translated_words_path):
+    with open(translated_words_path, "r", encoding="utf-8") as file:
+        translations = json.load(file)
+
+translate_to_other_languages(
+    source_language=SRC_LANG,
+    word_list=word_list,
+    translations=translations,
+    batch_size=100,
+)
diff --git a/src/scribe_data/extract_transform/languages/Portuguese/translations/translate_words.py b/src/scribe_data/extract_transform/languages/Portuguese/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "Portuguese"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/Russian/translations/translate_words.py b/src/scribe_data/extract_transform/languages/Russian/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "Russian"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/Spanish/translations/translate_words.py b/src/scribe_data/extract_transform/languages/Spanish/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "Spanish"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/languages/Swedish/translations/translate_words.py b/src/scribe_data/extract_transform/languages/Swedish/translations/translate_words.py
@@ -14,7 +14,9 @@
 PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
 sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
 
-from scribe_data.utils import translate_to_other_languages  # noqa: E402
+from scribe_data.extract_transform.translation.translation_utils import (  # noqa: E402
+    translate_to_other_languages,
+)
 
 SRC_LANG = "Swedish"
 translate_script_dir = os.path.dirname(os.path.abspath(__file__))

diff --git a/src/scribe_data/extract_transform/translation/translation_utils.py b/src/scribe_data/extract_transform/translation/translation_utils.py
@@ -0,0 +1,111 @@
+"""
+Utility functions for the machine translation process.
+
+Contents:
+    translation_interrupt_handler,
+    translate_to_other_languages
+"""
+
+import json
+import os
+import signal
+import sys
+
+from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
+
+PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
+PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
+sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)
+
+from scribe_data.utils import (  # noqa: E402
+    get_language_dir_path,
+    get_language_iso,
+    get_target_langcodes,
+)
+
+
+def translation_interrupt_handler(source_language, translations):
+    """
+    Handles interrupt signals and saves the current translation progress.
+
+    Parameters
+    ----------
+        source_language : str
+            The source language being translated from.
+
+        translations : list[dict]
+            The current list of translations.
+    """
+    print(
+        "\nThe interrupt signal has been caught and the current progress is being saved..."
+    )
+
+    with open(
+        f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
+        "w",
+        encoding="utf-8",
+    ) as file:
+        json.dump(translations, file, ensure_ascii=False, indent=4)
+
+    print("The current progress is saved to the translated_words.json file.")
+    exit()
+
+
+def translate_to_other_languages(source_language, word_list, translations, batch_size):
+    """
+    Translates a list of words from the source language to other target languages using batch processing.
+
+    Parameters
+    ----------
+        source_language : str
+            The source language being translated from.
+
+        word_list : list[str]
+            The list of words to translate.
+
+        translations : dict
+            The current dictionary of translations.
+
+        batch_size : int
+            The number of words to translate in each batch.
+    """
+    model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
+    tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
+
+    signal.signal(
+        signal.SIGINT,
+        lambda sig, frame: translation_interrupt_handler(source_language, translations),
+    )
+
+    for i in range(0, len(word_list), batch_size):
+        batch_words = word_list[i : i + batch_size]
+        print(f"Translating batch {i//batch_size + 1}: {batch_words}")
+
+        for lang_code in get_target_langcodes(source_language):
+            tokenizer.src_lang = get_language_iso(source_language)
+            encoded_words = tokenizer(batch_words, return_tensors="pt", padding=True)
+            generated_tokens = model.generate(
+                **encoded_words, forced_bos_token_id=tokenizer.get_lang_id(lang_code)
+            )
+            translated_words = tokenizer.batch_decode(
+                generated_tokens, skip_special_tokens=True
+            )
+
+            for word, translation in zip(batch_words, translated_words):
+                if word not in translations:
+                    translations[word] = {}
+
+                translations[word][lang_code] = translation
+
+        print(f"Batch {i//batch_size + 1} translation completed.")
+
+        with open(
+            f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
+            "w",
+            encoding="utf-8",
+        ) as file:
+            json.dump(translations, file, ensure_ascii=False, indent=4)
+
+    print(
+        "Translation results for all words are saved to the translated_words.json file."
+    )
diff --git a/...ct_transform/update_words_to_translate.py → .../translation/update_words_to_translate.py b/...ct_transform/update_words_to_translate.py → .../translation/update_words_to_translate.py
@@ -8,7 +8,7 @@
 
 Example
 -------
-    python update_words_to_translate.py '["French", "German"]'
+    python3 src/scribe_data/extract_transform/translation/update_words_to_translate.py '["French", "German"]'
 """
 
 import json

diff --git a/...ibe_data/extract_transform/emoji_utils.py → .../extract_transform/unicode/emoji_utils.py b/...ibe_data/extract_transform/emoji_utils.py → .../extract_transform/unicode/emoji_utils.py
diff --git a/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb b/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb
@@ -35,9 +35,7 @@
    "source": [
     "import os\n",
     "import sys\n",
-    "import json\n",
     "\n",
-    "from tqdm.auto import tqdm\n",
     "from IPython.display import display, HTML\n",
     "display(HTML(\"<style>.container { width:99% !important; }</style>\"))"
    ]
@@ -71,7 +69,7 @@
    },
    "outputs": [],
    "source": [
-    "from scribe_data.extract_transform.process_unicode import gen_emoji_lexicon"
+    "from scribe_data.extract_transform.unicode.process_unicode import gen_emoji_lexicon"
    ]
   },
   {

diff --git a/...data/extract_transform/process_unicode.py → ...ract_transform/unicode/process_unicode.py b/...data/extract_transform/process_unicode.py → ...ract_transform/unicode/process_unicode.py
@@ -14,13 +14,13 @@
 from icu import Char, UProperty
 from tqdm.auto import tqdm
 
-from scribe_data.extract_transform.emoji_utils import get_emoji_codes_to_ignore
+from scribe_data.extract_transform.unicode.emoji_utils import get_emoji_codes_to_ignore
 from scribe_data.utils import (
     get_language_iso,
     get_path_from_et_dir,
 )
 
-from . import _resources
+from .. import _resources
 
 emoji_codes_to_ignore = get_emoji_codes_to_ignore()
 

diff --git a/.../extract_transform/query_profanity.sparql → ...transform/wikidata/query_profanity.sparql b/.../extract_transform/query_profanity.sparql → ...transform/wikidata/query_profanity.sparql
diff --git a/...transform/query_words_to_translate.sparql → .../wikidata/query_words_to_translate.sparql b/...transform/query_words_to_translate.sparql → .../wikidata/query_words_to_translate.sparql
diff --git a/...ibe_data/extract_transform/update_data.py → ...extract_transform/wikidata/update_data.py b/...ibe_data/extract_transform/update_data.py → ...extract_transform/wikidata/update_data.py
@@ -11,7 +11,7 @@
 
 Example
 -------
-    python update_data.py '["French", "German"]' '["nouns", "verbs"]'
+    python3 src/scribe_data/extract_transform/wikidata/update_data.py '["French", "German"]' '["nouns", "verbs"]'
 """
 
 import itertools

diff --git a/...be_data/extract_transform/extract_wiki.py → ...tract_transform/wikipedia/extract_wiki.py b/...be_data/extract_transform/extract_wiki.py → ...tract_transform/wikipedia/extract_wiki.py