Skip to content

Commit

Permalink
#75 Italian translation process and reorder directory structure
Browse files Browse the repository at this point in the history
  • Loading branch information
andrewtavis committed Mar 24, 2024
1 parent 02f220e commit 2b72e64
Show file tree
Hide file tree
Showing 22 changed files with 213 additions and 121 deletions.
7 changes: 3 additions & 4 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,10 +12,7 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).

## [Upcoming] Scribe-Data 3.3.0

<!-- - The translation process has been updated to allow for translations from non-English languages.
- Scribe-Data now outputs an SQLite table that has keys for target languages for each base language. -->
<!-- - English has been added to the data ETL process. -->

- The translation process has been updated to allow for translations from non-English languages ([#72](https://github.com/scribe-org/Scribe-Data/issues/72), [#73](https://github.com/scribe-org/Scribe-Data/issues/73), [#74](https://github.com/scribe-org/Scribe-Data/issues/74), [#75](https://github.com/scribe-org/Scribe-Data/issues/75), [#75](https://github.com/scribe-org/Scribe-Data/issues/75), [#76](https://github.com/scribe-org/Scribe-Data/issues/76), [#77](https://github.com/scribe-org/Scribe-Data/issues/77), [#78](https://github.com/scribe-org/Scribe-Data/issues/78), [#79](https://github.com/scribe-org/Scribe-Data/issues/79)).
- The documentation has been given a new layout with the logo in the top left ([#90](https://github.com/scribe-org/Scribe-Data/issues/90)).
- The documentation now has links to the code at the top of each page ([#91](https://github.com/scribe-org/Scribe-Data/issues/91)).

Expand All @@ -25,6 +22,8 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/).
- A Ruff based GitHub workflow was added to check the code formatting and lint the codebase on each pull request ([#109](https://github.com/scribe-org/Scribe-Data/issues/109)).
- The `_update_files` directory was renamed `update_files` as these files are used in non-internal manners now ([#57](https://github.com/scribe-org/Scribe-Data/issues/57)).
- A common function has been created to map Wikidata ids to noun genders ([#69](https://github.com/scribe-org/Scribe-Data/issues/69)).
- Files in the `extract_transform` directory were moved based on if they access Wikidata, Wikipedia or Unicode.
- Translation files are further moved to their own directory.

## Scribe-Data 3.2.2

Expand Down
10 changes: 5 additions & 5 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@

## Wikidata and Wikipedia language data extraction

**Scribe-Data** contains the scripts for extracting and formatting data from [Wikidata](https://www.wikidata.org/) and [Wikipedia](https://www.wikipedia.org/) for Scribe applications. Updates to the language keyboard and interface data can be done using [scribe_data/load/update_data.py](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load/update_data.py) and the notebooks within the [scribe_data/load](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load) directory.
**Scribe-Data** contains the scripts for extracting and formatting data from [Wikidata](https://www.wikidata.org/) and [Wikipedia](https://www.wikipedia.org/) for Scribe applications. Updates to the language keyboard and interface data can be done using [scribe_data/extract_transform/wikidata/update_data.py](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/wikidata/update_data.py) and the notebooks within the [scribe_data/load](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/load) directory.

> [!NOTE]\
> The [contributing](#contributing) section has information for those interested, with the articles and presentations in [featured by](#featured-by) also being good resources for learning more about Scribe.
Expand All @@ -38,14 +38,14 @@ Check out Scribe's [architecture diagrams](https://github.com/scribe-org/Organiz

# Process [``](#contents)

[scribe_data/extract_transform/update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) and the notebooks within the [scribe_data/extract_transform](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform) directory are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) when they're active.
[scribe_data/extract_transform/wikidata/update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) and the notebooks within the [scribe_data/extract_transform](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform) directory are used to update all data for [Scribe-iOS](https://github.com/scribe-org/Scribe-iOS), with this functionality later being expanded to update [Scribe-Android](https://github.com/scribe-org/Scribe-Android) and [Scribe-Desktop](https://github.com/scribe-org/Scribe-Desktop) when they're active.

The main data update process in [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) triggers [SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/languages) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran in [gen_emoji_lexicon.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb).
The main data update process in [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) triggers [SPARQL queries](https://github.com/scribe-org/Scribe-Data/tree/main/src/scribe_data/extract_transform/languages) to query language data from [Wikidata](https://www.wikidata.org/) using [SPARQLWrapper](https://github.com/RDFLib/sparqlwrapper) as a URI. The autosuggestion process derives popular words from [Wikipedia](https://www.wikipedia.org/) as well as those words that normally follow them for an effective baseline feature until natural language processing methods are employed. Functions to generate autosuggestions are ran in [gen_autosuggestions.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb). Emojis are further sourced from [Unicode CLDR](https://github.com/unicode-org/cldr), with this process being ran in [gen_emoji_lexicon.ipynb](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb).

Running [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) is done via the following CLI command:
Running [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/wikidata/update_data.py) is done via the following CLI command:

```bash
python3 src/scribe_data/extract_transform/update_data.py
python3 src/scribe_data/extract_transform/wikidata/update_data.py
```

The ultimate goal is that this repository will house language packs that are periodically updated with new [Wikidata](https://www.wikidata.org/) lexicographical data and data from other sources. These packs would then be available to download by users of Scribe applications.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "English"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "French"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "German"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
"""
Translates the Italian words queried from Wikidata to all other Scribe languages.
Example
-------
python3 src/scribe_data/extract_transform/languages/Italian/translations/translate_words.py
"""

import json
import os
import sys

PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "Italian"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
words_to_translate_path = os.path.join(translate_script_dir, "words_to_translate.json")

with open(words_to_translate_path, "r", encoding="utf-8") as file:
json_data = json.load(file)

word_list = [item["word"] for item in json_data]

translations = {}
translated_words_path = os.path.join(
translate_script_dir, "../formatted_data/translated_words.json"
)
if os.path.exists(translated_words_path):
with open(translated_words_path, "r", encoding="utf-8") as file:
translations = json.load(file)

translate_to_other_languages(
source_language=SRC_LANG,
word_list=word_list,
translations=translations,
batch_size=100,
)
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "Portuguese"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "Russian"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "Spanish"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,9 @@
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import translate_to_other_languages # noqa: E402
from scribe_data.extract_transform.translation.translation_utils import ( # noqa: E402
translate_to_other_languages,
)

SRC_LANG = "Swedish"
translate_script_dir = os.path.dirname(os.path.abspath(__file__))
Expand Down
111 changes: 111 additions & 0 deletions src/scribe_data/extract_transform/translation/translation_utils.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""
Utility functions for the machine translation process.
Contents:
translation_interrupt_handler,
translate_to_other_languages
"""

import json
import os
import signal
import sys

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0]
PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src"
sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC)

from scribe_data.utils import ( # noqa: E402
get_language_dir_path,
get_language_iso,
get_target_langcodes,
)


def translation_interrupt_handler(source_language, translations):
"""
Handles interrupt signals and saves the current translation progress.
Parameters
----------
source_language : str
The source language being translated from.
translations : list[dict]
The current list of translations.
"""
print(
"\nThe interrupt signal has been caught and the current progress is being saved..."
)

with open(
f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
"w",
encoding="utf-8",
) as file:
json.dump(translations, file, ensure_ascii=False, indent=4)

print("The current progress is saved to the translated_words.json file.")
exit()


def translate_to_other_languages(source_language, word_list, translations, batch_size):
"""
Translates a list of words from the source language to other target languages using batch processing.
Parameters
----------
source_language : str
The source language being translated from.
word_list : list[str]
The list of words to translate.
translations : dict
The current dictionary of translations.
batch_size : int
The number of words to translate in each batch.
"""
model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

signal.signal(
signal.SIGINT,
lambda sig, frame: translation_interrupt_handler(source_language, translations),
)

for i in range(0, len(word_list), batch_size):
batch_words = word_list[i : i + batch_size]
print(f"Translating batch {i//batch_size + 1}: {batch_words}")

for lang_code in get_target_langcodes(source_language):
tokenizer.src_lang = get_language_iso(source_language)
encoded_words = tokenizer(batch_words, return_tensors="pt", padding=True)
generated_tokens = model.generate(
**encoded_words, forced_bos_token_id=tokenizer.get_lang_id(lang_code)
)
translated_words = tokenizer.batch_decode(
generated_tokens, skip_special_tokens=True
)

for word, translation in zip(batch_words, translated_words):
if word not in translations:
translations[word] = {}

translations[word][lang_code] = translation

print(f"Batch {i//batch_size + 1} translation completed.")

with open(
f"{get_language_dir_path(source_language)}/formatted_data/translated_words.json",
"w",
encoding="utf-8",
) as file:
json.dump(translations, file, ensure_ascii=False, indent=4)

print(
"Translation results for all words are saved to the translated_words.json file."
)
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
Example
-------
python update_words_to_translate.py '["French", "German"]'
python3 src/scribe_data/extract_transform/translation/update_words_to_translate.py '["French", "German"]'
"""

import json
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -35,9 +35,7 @@
"source": [
"import os\n",
"import sys\n",
"import json\n",
"\n",
"from tqdm.auto import tqdm\n",
"from IPython.display import display, HTML\n",
"display(HTML(\"<style>.container { width:99% !important; }</style>\"))"
]
Expand Down Expand Up @@ -71,7 +69,7 @@
},
"outputs": [],
"source": [
"from scribe_data.extract_transform.process_unicode import gen_emoji_lexicon"
"from scribe_data.extract_transform.unicode.process_unicode import gen_emoji_lexicon"
]
},
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -14,13 +14,13 @@
from icu import Char, UProperty
from tqdm.auto import tqdm

from scribe_data.extract_transform.emoji_utils import get_emoji_codes_to_ignore
from scribe_data.extract_transform.unicode.emoji_utils import get_emoji_codes_to_ignore
from scribe_data.utils import (
get_language_iso,
get_path_from_et_dir,
)

from . import _resources
from .. import _resources

emoji_codes_to_ignore = get_emoji_codes_to_ignore()

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,7 +11,7 @@
Example
-------
python update_data.py '["French", "German"]' '["nouns", "verbs"]'
python3 src/scribe_data/extract_transform/wikidata/update_data.py '["French", "German"]' '["nouns", "verbs"]'
"""

import itertools
Expand Down
Loading

0 comments on commit 2b72e64

Please sign in to comment.