diff --git a/.github/docs/CONTRIBUTING_NO_BACK_LINKS.md b/.github/docs/CONTRIBUTING_NO_BACK_LINKS.md deleted file mode 100644 index fe7e4b7c7..000000000 --- a/.github/docs/CONTRIBUTING_NO_BACK_LINKS.md +++ /dev/null @@ -1,229 +0,0 @@ -# Contributing to Scribe-Data - -Thank you for your interest in contributing! - -Please take a moment to review this document in order to make the contribution process easy and effective for everyone involved. - -Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open-source project. In return, and in accordance with this project's [code of conduct](https://github.com/scribe-org/Scribe-Data/blob/main/.github/CODE_OF_CONDUCT.md), other contributors will reciprocate that respect in addressing your issue or assessing changes and features. - -If you have questions or would like to communicate with the team, please [join us in our public Matrix chat rooms](https://matrix.to/#/#scribe_community:matrix.org). We'd be happy to hear from you! - - - -## Contents - -- [First steps as a contributor](#first-steps) -- [Learning the tech stack](#learning-the-tech) -- [Development environment](#dev-env) -- [Issues and projects](#issues-projects) -- [Bug reports](#bug-reports) -- [Feature requests](#feature-requests) -- [Pull requests](#pull-requests) -- [Data edits](#data-edits) -- [Documentation](#documentation) - - - -## First steps as a contributor - -Thank you for your interest in contributing to Scribe-Data! We look forward to welcoming you to the community and working with you to build an tools for language learners to communicate effectively :) The following are some suggested steps for people interested in joining our community: - -- Please join the [public Matrix chat](https://matrix.to/#/#scribe_community:matrix.org) to connect with the community - - [Matrix](https://matrix.org/) is a network for secure, decentralized communication - - Scribe would suggest that you use the [Element](https://element.io/) client - - The [General](https://matrix.to/#/!yQJjLmluvlkWttNhKo:matrix.org?via=matrix.org) and [Data](https://matrix.to/#/#ScribeData:matrix.org) channels would be great places to start! - - Feel free to introduce yourself and tell us what your interests are if you're comfortable :) -- Read through this contributing guide for all the information you need to contribute -- Look into issues marked [`good first issue`](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22) and the [Projects board](https://github.com/orgs/scribe-org/projects/1) to get a better understanding of what you can work on -- Check out our [public designs on Figma](https://www.figma.com/file/c8945w2iyoPYVhsqW7vRn6/scribe_public_designs?type=design&node-id=405-464&mode=design&t=E3ccS9Z8MDVSizQ4-0) to understand Scribes's goals and direction -- Consider joining our [bi-weekly developer sync](https://etherpad.wikimedia.org/p/scribe-dev-sync)! - -> [!NOTE] -> Those new to Python or wanting to work on their Python skills are more than welcome to contribute! The team would be happy to help you on your development journey :) - - - -## Learning the tech stack - -Scribe is very open to contributions from people in the early stages of their coding journey! The following is a select list of documentation pages to help you understand the technologies we use. - -
Docs for those new to programming -

- -- [Mozilla Developer Network Learning Area](https://developer.mozilla.org/en-US/docs/Learn) - - Doing MDN sections for HTML, CSS and JavaScript is the best ways to get into web development! - -

-
- -
Python learning docs -

- -- [Python getting started guide](https://docs.python.org/3/tutorial/introduction.html) -- [Python getting started resources](https://www.python.org/about/gettingstarted/) - -

-
- - - -## Development environment - -The development environment for Scribe-Data can be installed via the following steps: - -1. [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the [Scribe-Data repo](https://github.com/scribe-org/Scribe-Data), clone your fork, and configure the remotes: - -> [!NOTE] -> ->
Consider using SSH -> ->

-> -> Alternatively to using HTTPS as in the instructions below, consider SSH to interact with GitHub from the terminal. SSH allows you to connect without a user-pass authentication flow. -> -> To run git commands with SSH, remember then to substitute the HTTPS URL, `https://github.com/...`, with the SSH one, `git@github.com:...`. -> -> - e.g. Cloning now becomes `git clone git@github.com:/Scribe-Data.git` -> -> GitHub also has their documentation on how to [Generate a new SSH key](https://docs.github.com/en/authentication/connecting-to-github-with-ssh/generating-a-new-ssh-key-and-adding-it-to-the-ssh-agent) 🔑 -> ->

->
- -```bash -# Clone your fork of the repo into the current directory. -git clone https://github.com//Scribe-Data.git -# Navigate to the newly cloned directory. -cd Scribe-Data -# Assign the original repo to a remote called "upstream". -git remote add upstream https://github.com/scribe-org/Scibe-Data.git -``` - -- Now, if you run `git remote -v` you should see two remote repositories named: - - `origin` (forked repository) - - `upstream` (Scribe-Data repository) - -2. Use [Python venv](https://docs.python.org/3/library/venv.html) to create the local development environment within your Scribe-Data directory: - - ```bash - python3 -m venv venv # make an environment venv - pip install --upgrade pip # make sure that pip is at the latest version - pip install -r requirements.txt # install dependencies - ``` - -> [!NOTE] -> Feel free to contact the team in the [Data room on Matrix](https://matrix.to/#/#ScribeData:matrix.org) if you're having problems getting your environment setup! - - - -## Issues and projects - -The [issue tracker for Scribe-Data](https://github.com/scribe-org/Scribe-Data/issues) is the preferred channel for [bug reports](#bug-reports), [features requests](#feature-requests) and [submitting pull requests](#pull-requests). Scribe also organizes related issues into [projects](https://github.com/scribe-org/Scribe-Data/projects). - -> [!NOTE]\ -> Just because an issue is assigned on GitHub doesn't mean that the team isn't interested in your contribution! Feel free to write [in the issues](https://github.com/scribe-org/Scribe-Data/issues) and we can potentially reassign it to you. - -Be sure to check the [`-next release-`](https://github.com/scribe-org/Scribe-Data/labels/-next%20release-) and [`-priority-`](https://github.com/scribe-org/Scribe-Data/labels/-priority-) labels in the [issues](https://github.com/scribe-org/Scribe-Data/issues) for those that are most important, as well as those marked [`good first issue`](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aissue+is%3Aopen+label%3A%22good+first+issue%22) that are tailored for first time contributors. - - - -## Bug reports - -A bug is a _demonstrable problem_ that is caused by the code in the repository. Good bug reports are extremely helpful - thank you! - -Guidelines for bug reports: - -1. **Use the GitHub issue search** to check if the issue has already been reported. - -2. **Check if the issue has been fixed** by trying to reproduce it using the latest `main` or development branch in the repository. - -3. **Isolate the problem** to make sure that the code in the repository is _definitely_ responsible for the issue. - -**Great Bug Reports** tend to have: - -- A quick summary -- Steps to reproduce -- What you expected would happen -- What actually happens -- Notes (why this might be happening, things tried that didn't work, etc) - -To make the above steps easier, the Scribe team asks that contributors report bugs using the [bug report template](https://github.com/scribe-org/Scribe-Data/issues/new?assignees=&labels=feature&template=bug_report.yml), with these issues further being marked with the [`bug`](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aopen+is%3Aissue+label%3Abug) label. - -Again, thank you for your time in reporting issues! - - - -## Feature requests - -Feature requests are more than welcome! Please take a moment to find out whether your idea fits with the scope and aims of the project. When making a suggestion, provide as much detail and context as possible, and further make clear the degree to which you would like to contribute in its development. Feature requests are marked with the [`feature`](https://github.com/scribe-org/Scribe-Data/issues?q=is%3Aopen+is%3Aissue+label%3Afeature) label, and can be made using the [feature request](https://github.com/scribe-org/Scribe-Data/issues/new?assignees=&labels=feature&template=feature_request.yml) template. - - - -## Pull requests - -Good pull requests - patches, improvements and new features - are the foundation of our community making Scribe-Data. They should remain focused in scope and avoid containing unrelated commits. Note that all contributions to this project will be made under [the specified license](https://github.com/scribe-org/Scribe-Data/blob/main/LICENSE.txt) and should follow the coding indentation and style standards ([contact us](https://matrix.to/#/#scribe_community:matrix.org) if unsure). - -**Please ask first** before embarking on any significant pull request (implementing features, refactoring code, etc), otherwise you risk spending a lot of time working on something that the developers might not want to merge into the project. With that being said, major additions are very appreciated! - -When making a contribution, adhering to the [GitHub flow](https://guides.github.com/introduction/flow/index.html) process is the best way to get your work merged: - -1. If you cloned a while ago, get the latest changes from upstream: - - ```bash - git checkout - git pull upstream - ``` - -2. Create a new topic branch (off the main project development branch) to contain your feature, change, or fix: - - ```bash - git checkout -b - ``` - -3. Commit your changes in logical chunks, and please try to adhere to [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/). - -> [!NOTE] -> The following are tools and methods to help you write good commit messages ✨ -> -> - [commitlint](https://commitlint.io/) helps write [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) -> - Git's [interactive rebase](https://docs.github.com/en/github/getting-started-with-github/about-git-rebase) cleans up commits - -4. Locally merge (or rebase) the upstream development branch into your topic branch: - - ```bash - git pull --rebase upstream - ``` - -5. Push your topic branch up to your fork: - - ```bash - git push origin - ``` - -6. [Open a Pull Request](https://help.github.com/articles/using-pull-requests/) with a clear title and description. - -Thank you in advance for your contributions! - - - -## Data edits - -> [!NOTE]\ -> Please see the [Wikidata and Scribe Guide](https://github.com/scribe-org/Organization/blob/main/WIKIDATAGUIDE.md) for an overview of [Wikidata](https://www.wikidata.org/) and how Scribe uses it. - -Scribe does not accept direct edits to the grammar JSON files as they are sourced from [Wikidata](https://www.wikidata.org/). Edits can be discussed and the [Scribe-Data](https://github.com/scribe-org/Scribe-Data) queries will be changed and ran before an update. If there is a problem with one of the files, then the fix should be made on [Wikidata](https://www.wikidata.org/) and not on Scribe. Feel free to let us know that edits have been made by [opening an issue](https://github.com/scribe-org/Scribe-Data/issues) and we'll be happy to integrate them! - - - -## Documentation - -The documentation for Scribe-Data can be found at [scribe-data.readthedocs.io](https://scribe-data.readthedocs.io/en/latest/). Documentation is an invaluable way to contribute to coding projects as it allows others to more easily understand the project structure and contribute. Issues related to documentation are marked with the [`documentation`](https://github.com/scribe-org/Scribe-Data/labels/documentation) label. - -Use the following commands to build the documentation locally: - -```bash -cd docs -make html -``` - -You can then open `index.html` within `docs/build/html` to check the local version of the documentation. diff --git a/.vscode/extensions.json b/.vscode/extensions.json index 7d33d594d..5b7579ff3 100644 --- a/.vscode/extensions.json +++ b/.vscode/extensions.json @@ -1,3 +1,7 @@ { - "recommendations": ["blokhinnv.wikidataqidlabels"] + "recommendations": [ + "blokhinnv.wikidataqidlabels", + "charliermarsh.ruff", + "streetsidesoftware.code-spell-checker" + ] } diff --git a/CHANGELOG.md b/CHANGELOG.md index c36bc4b57..5689d33e5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -16,6 +16,15 @@ Emojis for the following are chosen based on [gitmoji](https://gitmoji.dev/). - Scribe-Data now outputs an SQLite table that has keys for target languages for each base language. --> +- The documentation has been given a new layout with the logo in the top left ([#90](https://github.com/scribe-org/Scribe-Data/issues/90)). +- The documentation now has links to the code at the top of each page ([#91](https://github.com/scribe-org/Scribe-Data/issues/91)). + +### ♻️ Code Refactoring + +- The `_update_files` directory was renamed `update_files` as these files are used in non-internal manners now ([#57](https://github.com/scribe-org/Scribe-Data/issues/57)). +- A common function has been created to map Wikidata ids to noun genders ([#69](https://github.com/scribe-org/Scribe-Data/issues/69)). +- Code formatting was shifted from [black](https://github.com/psf/black) to [Ruff](https://github.com/astral-sh/ruff). + ## Scribe-Data 3.2.2 - Minor fixes to documentation index and file docstrings to fix errors. diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 807d1c5d0..cd63eba00 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -69,6 +69,21 @@ Scribe is very open to contributions from people in the early stages of their co ## Development environment [`⇧`](#contents) +> [!IMPORTANT] +> +>
Suggested IDE extensions +> +>

+> +> VS Code +> +> - [blokhinnv.wikidataqidlabels](https://marketplace.visualstudio.com/items?itemName=blokhinnv.wikidataqidlabels) +> - [charliermarsh.ruff](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) +> - [streetsidesoftware.code-spell-checker](https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker) + +>

+>
+ The development environment for Scribe-Data can be installed via the following steps: 1. [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the [Scribe-Data repo](https://github.com/scribe-org/Scribe-Data), clone your fork, and configure the remotes: @@ -105,11 +120,26 @@ git remote add upstream https://github.com/scribe-org/Scribe-Data.git 2. Use [Python venv](https://docs.python.org/3/library/venv.html) to create the local development environment within your Scribe-Data directory: - ```bash - python3 -m venv venv # make an environment venv - pip install --upgrade pip # make sure that pip is at the latest version - pip install -r requirements.txt # install dependencies - ``` +- On Unix or MacOS, run: + + ```bash + python3 -m venv venv # make an environment named venv + source venv/bin/activate # activate the environment + ``` + +- On Windows (using Command Prompt), run: + + ```bash + python -m venv venv + venv\Scripts\activate.bat + ``` + +After activating the virtual environment, install the required dependencies by running: + +```bash +pip install --upgrade pip # make sure that pip is at the latest version +pip install -r requirements.txt # install dependencies +``` > [!NOTE] > Feel free to contact the team in the [Data room on Matrix](https://matrix.to/#/#ScribeData:matrix.org) if you're having problems getting your environment setup! diff --git a/README.md b/README.md index 72998d687..f4975e115 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,7 @@ [![coc](https://img.shields.io/badge/Contributor%20Covenant-ff69b4.svg)](https://github.com/scribe-org/Scribe-Data/blob/main/.github/CODE_OF_CONDUCT.md) [![mastodon](https://img.shields.io/badge/Mastodon-6364FF.svg?logo=mastodon&logoColor=ffffff)](https://wikis.world/@scribe) [![matrix](https://img.shields.io/badge/Matrix-000000.svg?logo=matrix&logoColor=ffffff)](https://matrix.to/#/#scribe_community:matrix.org) -[![codestyle](https://img.shields.io/badge/black-000000.svg)](https://github.com/psf/black) +[![codestyle](https://img.shields.io/badge/Ruff-26122F.svg?logo=Ruff)](https://github.com/astral-sh/ruff) ## Wikidata and Wikipedia language data extraction @@ -46,7 +46,7 @@ The main data update process in [update_data.py](https://github.com/scribe-org/S Running [update_data.py](https://github.com/scribe-org/Scribe-Data/blob/main/src/scribe_data/extract_transform/update_data.py) is done via the following CLI command: ```bash -python src/scribe_data/extract_transform/update_data.py +python3 src/scribe_data/extract_transform/update_data.py ``` The ultimate goal is that this repository will house language packs that are periodically updated with new [Wikidata](https://www.wikidata.org/) lexicographical data and data from other sources. These packs would then be available to download by users of Scribe applications. @@ -90,6 +90,21 @@ Scribe does not accept direct edits to the grammar JSON files as they are source # Environment Setup [`⇧`](#contents) +> [!IMPORTANT] +> +>
Suggested IDE extensions +> +>

+> +> VS Code +> +> - [blokhinnv.wikidataqidlabels](https://marketplace.visualstudio.com/items?itemName=blokhinnv.wikidataqidlabels) +> - [charliermarsh.ruff](https://marketplace.visualstudio.com/items?itemName=charliermarsh.ruff) +> - [streetsidesoftware.code-spell-checker](https://marketplace.visualstudio.com/items?itemName=streetsidesoftware.code-spell-checker) +> +>

+>
+ The development environment for Scribe-Data can be installed via the following steps: 1. [Fork](https://docs.github.com/en/get-started/quickstart/fork-a-repo) the [Scribe-Data repo](https://github.com/scribe-org/Scribe-Data), clone your fork, and configure the remotes: @@ -126,24 +141,26 @@ git remote add upstream https://github.com/scribe-org/Scribe-Data.git 2. Use [Python venv](https://docs.python.org/3/library/venv.html) to create the local development environment within your Scribe-Data directory: - - On Unix or MacOS, run: +- On Unix or MacOS, run: + + ```bash + python3 -m venv venv # make an environment named venv + source venv/bin/activate # activate the environment + ``` - ```bash - python3 -m venv venv # make an environment named venv - source venv/bin/activate # activate the environment - ``` - On Windows (using Command Prompt), run: - - ```bash - python -m venv venv - venv\Scripts\activate.bat - ``` + + ```bash + python -m venv venv + venv\Scripts\activate.bat + ``` + After activating the virtual environment, install the required dependencies by running: - ```bash - pip install --upgrade pip # make sure that pip is at the latest version - pip install -r requirements.txt # install dependencies - ``` +```bash +pip install --upgrade pip # make sure that pip is at the latest version +pip install -r requirements.txt # install dependencies +``` > [!NOTE] > Feel free to contact the team in the [Data room on Matrix](https://matrix.to/#/#ScribeData:matrix.org) if you're having problems getting your environment setup! diff --git a/docs/source/_static/CONTRIBUTING.rst b/docs/source/_static/CONTRIBUTING.rst new file mode 100644 index 000000000..b4800bfcc --- /dev/null +++ b/docs/source/_static/CONTRIBUTING.rst @@ -0,0 +1,257 @@ +Contributing to Scribe-Data +=========================== + +Thank you for your interest in contributing! + +Please take a moment to review this document to make the contribution process is easy and effective for everyone involved. + +Following these guidelines helps to communicate that you respect the time of the developers managing and developing this open-source project. In return, and accordance with this project's `code of conduct `__, other contributors will reciprocate that respect in addressing your issue or assessing changes and features. + +If you have questions or would like to communicate with the team, please `join us in our public Matrix chat +rooms `__. We'd be happy to hear from you! + +Contents +-------- + +- `First steps as a contributor <#first-steps-as-a-contributor>`__ +- `Learning the tech stack <#learning-the-tech-stack>`__ +- `Development environment <#development-environment>`__ +- `Issues and projects <#issues-projects>`__ +- `Bug reports <#bug-reports>`__ +- `Feature requests <#feature-requests>`__ +- `Pull requests <#pull-requests>`__ +- `Data edits <#data-edits>`__ +- `Documentation <#documentation>`__ + +First steps as a contributor +---------------------------- + +Thank you for your interest in contributing to Scribe-Data! We look +forward to welcoming you to the community and working with you to build +a tool for language learners to communicate effectively. :) The +following are some suggested steps for people interested in joining our +community: + +- Please join the `public Matrix chat `__ to connect with the community + + - `Matrix `__ is a network for secure, decentralized communication + - Scribe would suggest that you use the `Element `__ client + - The `General `__ and `Data `__ channels would be great places to start! + - Feel free to introduce yourself and tell us what your interests are if you're comfortable :) + +- Read through this contributing guide for all the information you need to contribute +- Look into issues marked `good first issue `__ and the `Projects board `__ to get a a better understanding of what you can work on +- Check out our `public designs on Figma `__ to understand Scribes' goals and direction +- Consider joining our `bi-weekly developer sync `__! + +.. + + | **Note** + | Those new to Python or wanting to work on their Python skills are more than welcome to contribute! The team would be happy to help you on your development journey :) + +Learning the tech stack +----------------------- + +Scribe is very open to contributions from people in the early stages of their coding journey! The following is a select list of documentation pages to help you understand the technologies we use. + +.. raw:: html + +
Docs for those new to programming

+ +- `Mozilla Developer Network Learning Area `__ + + - Doing MDN sections for HTML, CSS, and JavaScript is the best ways to get into web development! + +.. raw:: html + +

+ +.. raw:: html + +
Python learning docs

+ +- `Python getting started guide `__ +- `Python getting started resources `__ + +.. raw:: html + +


+ +Development environment +----------------------- + +The development environment for Scribe-Data can be installed via the following steps: + +- `Fork `__ the `Scribe-Data repo `__, clone your fork, and configure the remotes: + +.. + +.. raw:: html + +
Note: Consider using SSH

+ +Alternatively, to use HTTPS as in the instructions below, consider SSH to interact with GitHub from the terminal. SSH allows you to connect without a user-pass authentication flow. + +To run git commands with SSH, remember then to substitute the HTTPS URL, ``https://github.com/...``, with the SSH one, ``git@github.com:...``. + +- e.g. Cloning now becomes ``git clone git@github.com:/Scribe-Data.git`` + +GitHub also has documentation on how to `Generate a new SSH key `__ 🔑 + +.. raw:: html + +


+ +.. + +.. code:: bash + + # Clone your fork of the repo into the current directory. + git clone https://github.com//Scribe-Data.git + # Navigate to the newly cloned directory. + cd Scribe-Data + # Assign the original repo to a remote called "upstream". + git remote add upstream https://github.com/scribe-org/Scibe-Data.git + +.. + +- Now, if you run ``git remote -v`` you should see two remote repositories named: + + - ``origin`` (forked repository) + - ``upstream`` (Scribe-Data repository) + +.. + +- Use `Python venv `__ to create the local development environment within your Scribe-Data directory: + +.. code:: bash + + python3 -m venv venv # make an environment venv + pip install --upgrade pip # make sure that pip is at the latest version + pip install -r requirements.txt # install dependencies + +.. + + | **Note** + | Feel free to contact the team in the `Data room on Matrix `__ if you're having problems getting your environment set up! + +Issues and projects +------------------- + +The `issue tracker for Scribe-Data `__ is the +preferred channel for `bug reports <#bug-reports>`__, `features requests <#feature-requests>`__ and `submitting pull +requests <#pull-requests>`__. Scribe also organizes related issues into `projects `__. + +.. + + | **Note** + | Just because an issue is assigned on GitHub doesn't mean that the team isn't interested in your contribution! Feel free to write `in the issues `__ and we can potentially reassign it to you. + +Be sure to check the `-next release- `__ +and `-priority- `__ +labels in the `issues `__ for those +that are most important, as well as those marked `good first issue `__ that are tailored for first-time contributors. + +Bug reports +----------- + +A bug is a *demonstrable problem* that is caused by the code in the repository. Good bug reports are extremely helpful - thank you! + +Guidelines for bug reports: + +1. **Use the GitHub issue search** to check if the issue has already been reported. + +2. **Check if the issue has been fixed** by trying to reproduce it using the latest ``main`` or development branch in the repository. + +3. **Isolate the problem** to make sure that the code in the repository is *definitely* responsible for the issue. + +**Great Bug Reports** tend to have: + +- A quick summary +- Steps to reproduce +- What you expected would happen +- What actually happens +- Notes (why this might be happening, things tried that didn't work, etc) + +To make the above steps easier, the Scribe team asks that contributors report bugs using the `bug report +template `__, with these issues further being marked with the `bug `__ label. + +Again, thank you for your time in reporting issues! + +Feature requests +---------------- + +Feature requests are more than welcome! Please take a moment to find out whether your idea fits with the scope and aims of the project. When making a suggestion, provide as much detail and context as possible, and further, make clear the degree to which you would like to contribute in its development. Feature requests are marked with the +`feature `__ label, and can be made using the `feature request `__ template. + +Pull requests +------------- + +Good pull requests - patches, improvements and new features - are the foundation of our community making Scribe-Data. They should remain focused in scope and avoid containing unrelated commits. Note that all contributions to this project will be made under `the specified license `__ and should follow the coding indentation and style standards (`contact us `__ if unsure). + +**Please ask first** before embarking on any significant pull request (implementing features, refactoring code, etc), otherwise, you risk spending a lot of time working on something that the developers might not want to merge into the project. With that being said, major additions are very appreciated! + +When making a contribution, adhering to the `GitHub flow `__ process is the best way to get your work merged: + +1. If you cloned a while ago, get the latest changes from upstream: + +.. code:: bash + + git checkout + git pull upstream + +2. Create a new topic branch (off the main project development branch) to contain your feature, change, or fix: + +.. code:: bash + + git checkout -b + +3. Commit your changes in logical chunks, and please try to adhere to `Conventional Commits `__. + +.. + + | **Note** + | The following are tools and methods to help you write good commit messages ✨ + | • `commitlint `__ helps write `Conventional Commits `__ + | • Git's `interactive rebase `__ cleans up commits + +4. Locally merge (or rebase) the upstream development branch into your topic branch: + +.. code:: bash + + git pull --rebase upstream + +5. Push your topic branch up to your fork: + +.. code:: bash + + git push origin + +6. `Open a Pull Request `__ with a clear title and description. + +Thank you in advance for your contributions! + +Data edits +---------- + +.. + + | **Note** + | Please see the `Wikidata and Scribe Guide `__ for an overview of `Wikidata `__ and how Scribe uses it. + +Scribe does not accept direct edits to the grammar JSON files as they are sourced from `Wikidata `__. Edits can be discussed and the `Scribe-Data `__ queries will be changed and run before an update. If there is a problem with one of the files, then the fix should be made on `Wikidata `__ and not on Scribe. Feel free to let us know that edits have been made by `opening an issue `__ and we’ll be happy to integrate them! + +Documentation +------------- + +The documentation for Scribe-Data can be found at `scribe-data.readthedocs.io `__. Documentation is an invaluable way to contribute to coding projects as it allows others to more easily understand the project structure and contribute. Issues related to documentation are marked with the `documentation `__ label. + +Use the following commands to build the documentation locally: + +.. code:: bash + + cd docs + make html + +You can then open ``index.html`` within ``docs/build/html`` to check the +local version of the documentation. diff --git a/docs/source/_static/ScribeDataLogo.png b/docs/source/_static/ScribeDataLogo.png new file mode 100644 index 000000000..98c81615a Binary files /dev/null and b/docs/source/_static/ScribeDataLogo.png differ diff --git a/docs/source/_static/custom.css b/docs/source/_static/custom.css new file mode 100644 index 000000000..bef8928d4 --- /dev/null +++ b/docs/source/_static/custom.css @@ -0,0 +1,39 @@ +/* Change the sidebar header color to match the new logo background. */ +.wy-side-nav-search { + background-color: #f2f5f8 !important; +} + +/* Update the sidebar background color. */ +.wy-nav-side { + background-color: #f2f5f8 !important; +} + +/* Update the version color to match the background color. */ +.version { + color: #000 !important; +} + +/* Change sidebar text color to black for all sidebar items. */ +.wy-nav-side .wy-nav-content, +.wy-nav-side .wy-menu-vertical a { + color: #000 !important; +} + +/* Sidebar link text color on hover. */ +.wy-nav-side .wy-nav-content a:hover, +.wy-nav-side .wy-menu-vertical a:hover { + color: #000 !important; +} +/* Sidebar item background color on hover. */ +.wy-nav-side .wy-menu-vertical li.toctree-l1.current > a:hover, +.wy-nav-side .wy-menu-vertical li.toctree-l1 > a:hover { + background-color: #5495c9 !important; /* ligher Python blue background */ +} +/* Logo styling. */ +.wy-side-nav-search .wy-nav-top img { + padding: 15px; + max-height: 80px; + max-width: 100%; + display: block; + margin: 0 auto; +} diff --git a/docs/source/conf.py b/docs/source/conf.py index fefbcbead..90854084c 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -18,7 +18,7 @@ # import sphinx_rtd_theme -sys.path.insert(0, os.path.abspath("..")) +sys.path.insert(0, os.path.abspath("../../src")) # -- Project information ----------------------------------------------------- @@ -175,3 +175,15 @@ "Miscellaneous", ) ] + +# Adding logo to the docs sidebar. +html_logo = "_static/ScribeDataLogo.png" +html_theme_options = { + "logo_only": True, + "display_version": True, +} + +# Importing custom css for theme customization. +html_css_files = [ + "custom.css", +] diff --git a/docs/source/index.rst b/docs/source/index.rst index 15c51f5fb..e06ebd1bb 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -35,8 +35,8 @@ .. |matrix| image:: https://img.shields.io/badge/Matrix-000000.svg?logo=matrix&logoColor=ffffff :target: https://matrix.to/#/#scribe_community:matrix.org -.. |codestyle| image:: https://img.shields.io/badge/black-000000.svg - :target: https://github.com/psf/black +.. |codestyle| image:: https://img.shields.io/badge/Ruff-26122F.svg?logo=Ruff + :target: https://github.com/astral-sh/ruff Wikidata and Wikipedia language data extraction diff --git a/docs/source/notes.rst b/docs/source/notes.rst index 4ca2eb081..9d9aa20d0 100644 --- a/docs/source/notes.rst +++ b/docs/source/notes.rst @@ -1,9 +1,9 @@ +.. mdinclude:: _static/CONTRIBUTING.rst + License ======= .. literalinclude:: ../../LICENSE.txt :language: text -.. mdinclude:: ../../.github/docs/CONTRIBUTING_NO_BACK_LINKS.md - .. mdinclude:: ../../CHANGELOG.md diff --git a/docs/source/scribe_data/check_language_data.rst b/docs/source/scribe_data/check_language_data.rst new file mode 100644 index 000000000..6c27c4afe --- /dev/null +++ b/docs/source/scribe_data/check_language_data.rst @@ -0,0 +1,4 @@ +check_language_data +=================== + +`View code on Github `_ diff --git a/docs/source/scribe_data/checkquery.rst b/docs/source/scribe_data/checkquery.rst index c935119cd..ec8f3cc95 100644 --- a/docs/source/scribe_data/checkquery.rst +++ b/docs/source/scribe_data/checkquery.rst @@ -1,6 +1,8 @@ checkquery ========== +`View code on Github `_ + .. automodule:: scribe_data.checkquery :members: :private-members: diff --git a/docs/source/scribe_data/extract_transform/index.rst b/docs/source/scribe_data/extract_transform/index.rst index 053586c51..067fae692 100644 --- a/docs/source/scribe_data/extract_transform/index.rst +++ b/docs/source/scribe_data/extract_transform/index.rst @@ -1,2 +1,12 @@ extract_transform ================= + +`View code on Github `_ + +.. toctree:: + :maxdepth: 1 + + languages/index + unicode/index + wikidata/index + wikipedia/index diff --git a/docs/source/scribe_data/extract_transform/languages/index.rst b/docs/source/scribe_data/extract_transform/languages/index.rst new file mode 100644 index 000000000..faaa2e319 --- /dev/null +++ b/docs/source/scribe_data/extract_transform/languages/index.rst @@ -0,0 +1,4 @@ +languages +========= + +`View code on Github `_ diff --git a/docs/source/scribe_data/extract_transform/unicode/index.rst b/docs/source/scribe_data/extract_transform/unicode/index.rst new file mode 100644 index 000000000..df5051271 --- /dev/null +++ b/docs/source/scribe_data/extract_transform/unicode/index.rst @@ -0,0 +1,4 @@ +unicode +======= + +`View code on Github `_ diff --git a/docs/source/scribe_data/extract_transform/wikidata/index.rst b/docs/source/scribe_data/extract_transform/wikidata/index.rst new file mode 100644 index 000000000..94e602812 --- /dev/null +++ b/docs/source/scribe_data/extract_transform/wikidata/index.rst @@ -0,0 +1,4 @@ +wikidata +======== + +`View code on Github `_ diff --git a/docs/source/scribe_data/extract_transform/wikipedia/index.rst b/docs/source/scribe_data/extract_transform/wikipedia/index.rst new file mode 100644 index 000000000..c7708b453 --- /dev/null +++ b/docs/source/scribe_data/extract_transform/wikipedia/index.rst @@ -0,0 +1,4 @@ +wikipedia +========= + +`View code on Github `_ diff --git a/docs/source/scribe_data/index.rst b/docs/source/scribe_data/index.rst index 9b33212f2..63d6ee368 100644 --- a/docs/source/scribe_data/index.rst +++ b/docs/source/scribe_data/index.rst @@ -4,7 +4,8 @@ scribe_data .. toctree:: :maxdepth: 1 + check_language_data + checkquery extract_transform/index load/index - checkquery utils diff --git a/docs/source/scribe_data/load/data_to_sqlite.rst b/docs/source/scribe_data/load/data_to_sqlite.rst new file mode 100644 index 000000000..86adb4e6c --- /dev/null +++ b/docs/source/scribe_data/load/data_to_sqlite.rst @@ -0,0 +1,4 @@ +data_to_sqlite +============== + +`View code on Github `_ diff --git a/docs/source/scribe_data/load/databases/index.rst b/docs/source/scribe_data/load/databases/index.rst new file mode 100644 index 000000000..29b25e055 --- /dev/null +++ b/docs/source/scribe_data/load/databases/index.rst @@ -0,0 +1,4 @@ +databases +========= + +`View code on Github `_ diff --git a/docs/source/scribe_data/load/index.rst b/docs/source/scribe_data/load/index.rst index 74b060d72..fbc2d3a60 100644 --- a/docs/source/scribe_data/load/index.rst +++ b/docs/source/scribe_data/load/index.rst @@ -1,2 +1,10 @@ load ==== + +`View code on Github `_ + +.. toctree:: + databases/index.rst + update_files/index.rst + data_to_sqlite + send_dbs_to_scribe diff --git a/docs/source/scribe_data/load/send_dbs_to_scribe.rst b/docs/source/scribe_data/load/send_dbs_to_scribe.rst new file mode 100644 index 000000000..e9334bb26 --- /dev/null +++ b/docs/source/scribe_data/load/send_dbs_to_scribe.rst @@ -0,0 +1,4 @@ +send_dbs_to_scribe +================== + +`View code on Github `_ diff --git a/docs/source/scribe_data/load/update_files/index.rst b/docs/source/scribe_data/load/update_files/index.rst new file mode 100644 index 000000000..4743cace3 --- /dev/null +++ b/docs/source/scribe_data/load/update_files/index.rst @@ -0,0 +1,4 @@ +update_files +============ + +`View code on Github `_ diff --git a/docs/source/scribe_data/utils.rst b/docs/source/scribe_data/utils.rst index e016b8eb4..074e9b125 100644 --- a/docs/source/scribe_data/utils.rst +++ b/docs/source/scribe_data/utils.rst @@ -1,6 +1,8 @@ utils ===== +`View code on Github `_ + .. automodule:: scribe_data.utils :members: :private-members: diff --git a/requirements.txt b/requirements.txt index 25b0e1e75..a83ef5520 100644 --- a/requirements.txt +++ b/requirements.txt @@ -1,5 +1,4 @@ beautifulsoup4==4.9.3 -black>=19.10b0 certifi>=2020.12.5 defusedxml==0.7.1 emoji>=2.2.0 @@ -14,6 +13,7 @@ pyarrow>=15.0.0 PyICU>=2.10.2 pytest-cov>=3.0.0 regex>=2023.3.23 +ruff>=0.3.3 sentencepiece>=0.1.95 SPARQLWrapper>=2.0.0 sphinx-rtd-theme>=2.0.0 diff --git a/src/scribe_data/extract_transform/emoji_utils.py b/src/scribe_data/extract_transform/emoji_utils.py index bde8e327c..9a510fe2b 100644 --- a/src/scribe_data/extract_transform/emoji_utils.py +++ b/src/scribe_data/extract_transform/emoji_utils.py @@ -1,7 +1,4 @@ """ -Emoji Utilities ---------------- - Module for a function to get emojis we want to filter from suggestions. Contents: diff --git a/src/scribe_data/extract_transform/extract_wiki.py b/src/scribe_data/extract_transform/extract_wiki.py index 0c1814c26..1466fd8de 100644 --- a/src/scribe_data/extract_transform/extract_wiki.py +++ b/src/scribe_data/extract_transform/extract_wiki.py @@ -1,7 +1,4 @@ """ -Extract Wiki ------------- - Module for downloading and creating workable files from Wikipedia dumps. Contents: diff --git a/src/scribe_data/extract_transform/languages/English/formatted_data/translated_words.json b/src/scribe_data/extract_transform/languages/English/formatted_data/translated_words.json new file mode 100644 index 000000000..deb33e9e3 --- /dev/null +++ b/src/scribe_data/extract_transform/languages/English/formatted_data/translated_words.json @@ -0,0 +1,849 @@ +[ + { + "after": { + "fr": "Après", + "de": "Nach dem", + "it": "Dopo", + "pt": "Depois de", + "ru": "После", + "es": "Después", + "sv": "Efter" + } + }, + { + "around": { + "fr": "autour", + "de": "Umgeben", + "it": "Intorno", + "pt": "ao redor", + "ru": "вокруг", + "es": "alrededor", + "sv": "Omkring" + } + }, + { + "beside": { + "fr": "à côté", + "de": "Neben", + "it": "accanto", + "pt": "ao lado", + "ru": "рядом", + "es": "Al lado", + "sv": "Bredvid" + } + }, + { + "through": { + "fr": "à travers", + "de": "durch", + "it": "attraverso", + "pt": "Por meio", + "ru": "через", + "es": "A través", + "sv": "genom" + } + }, + { + "past": { + "fr": "Le passé", + "de": "Vergangenheit", + "it": "Il passato", + "pt": "passado", + "ru": "прошлое", + "es": "pasado", + "sv": "förflutna" + } + }, + { + "despite": { + "fr": "Malgré", + "de": "Trotz", + "it": "Nonostante", + "pt": "Apesar de", + "ru": "Несмотря", + "es": "A pesar de", + "sv": "Trots att" + } + }, + { + "always": { + "fr": "toujours", + "de": "immer", + "it": "sempre", + "pt": "sempre", + "ru": "Всегда", + "es": "Siempre", + "sv": "Alltid" + } + }, + { + "orange": { + "fr": "Orange", + "de": "Orange", + "it": "L’arancia", + "pt": "Laranja", + "ru": "Оранжевый", + "es": "Orange", + "sv": "orange" + } + }, + { + "beautiful": { + "fr": "Beaucoup", + "de": "Schönes", + "it": "bella", + "pt": "bonita", + "ru": "Красивый", + "es": "hermosa", + "sv": "vacker" + } + }, + { + "large": { + "fr": "grand grand", + "de": "Große", + "it": "Grande", + "pt": "Grande", + "ru": "Большой", + "es": "Grandes", + "sv": "Stora" + } + }, + { + "serious": { + "fr": "sérieux", + "de": "ernst", + "it": "serio", + "pt": "sério", + "ru": "Серьезный", + "es": "serio", + "sv": "allvarligt" + } + }, + { + "bright": { + "fr": "Brille", + "de": "Brille", + "it": "Il brillo", + "pt": "brilhante", + "ru": "Яркий", + "es": "brillo", + "sv": "ljusa" + } + }, + { + "strong": { + "fr": "Forte", + "de": "starke", + "it": "forte", + "pt": "forte", + "ru": "Сильный", + "es": "fuerte", + "sv": "starka" + } + }, + { + "sweet": { + "fr": "doux", + "de": "Süßes", + "it": "dolce", + "pt": "doce", + "ru": "Сладкий", + "es": "dulce", + "sv": "söta" + } + }, + { + "clear": { + "fr": "clair", + "de": "klar", + "it": "chiaro", + "pt": "Claro", + "ru": "Яркий", + "es": "claramente", + "sv": "tydligt" + } + }, + { + "deep": { + "fr": "profondeur", + "de": "tief", + "it": "profondo", + "pt": "profundidade", + "ru": "глубокий", + "es": "profundidad", + "sv": "djupt" + } + }, + { + "different": { + "fr": "Différents", + "de": "unterschiedlich", + "it": "Differenza", + "pt": "Diferentes", + "ru": "Разное", + "es": "diferentes", + "sv": "annorlunda" + } + }, + { + "difficult": { + "fr": "Difficile", + "de": "Schwierig", + "it": "Difficile", + "pt": "Difícil", + "ru": "Трудное", + "es": "difícil", + "sv": "svåra" + } + }, + { + "come": { + "fr": "Venez", + "de": "Kommen", + "it": "Vieni", + "pt": "Venha", + "ru": "Приходите", + "es": "Venga", + "sv": "Kom och" + } + }, + { + "wait": { + "fr": "Attendre", + "de": "Warten", + "it": "Aspetta", + "pt": "Aguardando", + "ru": "ждать", + "es": "Espera", + "sv": "Väntar" + } + }, + { + "except": { + "fr": "Sauf", + "de": "Ausgenommen", + "it": "tranne che", + "pt": "Excepção", + "ru": "за исключением", + "es": "excepto", + "sv": "Förutom" + } + }, + { + "purpose": { + "fr": "Objectif", + "de": "Zweck", + "it": "Obiettivo", + "pt": "Objetivo", + "ru": "Цель", + "es": "Objetivo", + "sv": "syftet" + } + }, + { + "begin": { + "fr": "Début", + "de": "beginnt", + "it": "Iniziamo", + "pt": "Começando", + "ru": "Начало", + "es": "Inicio", + "sv": "Börja" + } + }, + { + "throw": { + "fr": "Jouer", + "de": "Schießen", + "it": "lancio", + "pt": "lança", + "ru": "Стрельба", + "es": "lanzar", + "sv": "kastar" + } + }, + { + "teach": { + "fr": "enseignant", + "de": "Lehre", + "it": "insegnare", + "pt": "Ensino", + "ru": "Учитель", + "es": "enseñar", + "sv": "Lär dig" + } + }, + { + "slope": { + "fr": "Le Slope", + "de": "Schlange", + "it": "di Slope", + "pt": "Limpeza", + "ru": "Слайд", + "es": "El Slope", + "sv": "Slippa" + } + }, + { + "smash": { + "fr": "Smoothie", + "de": "Schmutz", + "it": "di smash", + "pt": "Mãe Smash", + "ru": "Смаш", + "es": "El Smash", + "sv": "Smash" + } + }, + { + "smile": { + "fr": "sourire", + "de": "Lächeln", + "it": "Il sorriso", + "pt": "sorriso", + "ru": "улыбка", + "es": "sonrisas", + "sv": "Ett leende" + } + }, + { + "liberate": { + "fr": "libérée", + "de": "Befreiung", + "it": "Liberato", + "pt": "Libertação", + "ru": "освобождать", + "es": "Liberación", + "sv": "befriade" + } + }, + { + "rate": { + "fr": "taux", + "de": "Rate", + "it": "Tasso", + "pt": "Taxa", + "ru": "Уровень", + "es": "La tasa", + "sv": "Rättigheter" + } + }, + { + "point": { + "fr": "point", + "de": "Punkt", + "it": "Il punto", + "pt": "ponto", + "ru": "Точка", + "es": "punto", + "sv": "Poäng" + } + }, + { + "print": { + "fr": "Printé", + "de": "Druck", + "it": "stampa", + "pt": "Impressão", + "ru": "Принтер", + "es": "impresión", + "sv": "tryck" + } + }, + { + "bar": { + "fr": "Bar", + "de": "Bar", + "it": "Bar", + "pt": "Bar", + "ru": "Бар", + "es": "bar", + "sv": "Bar" + } + }, + { + "break": { + "fr": "La pause", + "de": "Pause", + "it": "pausa", + "pt": "Pausa", + "ru": "Перерыв", + "es": "La pausa", + "sv": "Avbrott" + } + }, + { + "call": { + "fr": "Appel", + "de": "Anrufe", + "it": "Chiamate", + "pt": "Chamado", + "ru": "Звонок", + "es": "llamadas", + "sv": "ringer" + } + }, + { + "initiate": { + "fr": "Initiation", + "de": "Initiieren", + "it": "iniziare", + "pt": "Iniciação", + "ru": "Инициативы", + "es": "Inicio", + "sv": "initiera" + } + }, + { + "contribute": { + "fr": "Contribuer", + "de": "Beiträge", + "it": "contributi", + "pt": "Contribuição", + "ru": "Вклад", + "es": "Contribución", + "sv": "Bidrag" + } + }, + { + "test": { + "fr": "Tests", + "de": "Test", + "it": "Il test", + "pt": "Testes", + "ru": "Тест", + "es": "Testes", + "sv": "Testning" + } + }, + { + "deal": { + "fr": "Accord", + "de": "Vereinbarung", + "it": "Accordo", + "pt": "acordo", + "ru": "Договор", + "es": "Acuerdo", + "sv": "Avtal" + } + }, + { + "dine": { + "fr": "Dîne", + "de": "Ihre", + "it": "Il tuo", + "pt": "Dinheiro", + "ru": "Тёни", + "es": "Tiene", + "sv": "Dina" + } + }, + { + "meet": { + "fr": "Rencontre", + "de": "Treffen", + "it": "Incontrare", + "pt": "Encontro", + "ru": "Встреча", + "es": "Encuentro", + "sv": "möter" + } + }, + { + "area": { + "fr": "La zone", + "de": "Region", + "it": "Regione", + "pt": "Área", + "ru": "Область", + "es": "Área", + "sv": "Område" + } + }, + { + "today": { + "fr": "Aujourd’hui", + "de": "Heute", + "it": "Oggi", + "pt": "Hoje em dia", + "ru": "Сегодня", + "es": "Hoy hoy", + "sv": "i dag" + } + }, + { + "wear": { + "fr": "Porter", + "de": "Kleidung", + "it": "indossare", + "pt": "Usando", + "ru": "носить", + "es": "Usar", + "sv": "bära" + } + }, + { + "wave": { + "fr": "La vague", + "de": "Wellen", + "it": "Il Wave", + "pt": "A onda", + "ru": "Волна", + "es": "Las ondas", + "sv": "våg" + } + }, + { + "wind": { + "fr": "Le vent", + "de": "Wind", + "it": "Il vento", + "pt": "Vento", + "ru": "Ветер", + "es": "El viento", + "sv": "Vinden" + } + }, + { + "floor": { + "fr": "Le sol", + "de": "Boden", + "it": "Il pavimento", + "pt": "piso", + "ru": "Поверхность", + "es": "El piso", + "sv": "golv" + } + }, + { + "man": { + "fr": "L’homme", + "de": "Mann", + "it": "uomo", + "pt": "Homem", + "ru": "Человек", + "es": "El hombre", + "sv": "Människa" + } + }, + { + "word": { + "fr": "Paroles", + "de": "Wort", + "it": "Parola", + "pt": "Palavra", + "ru": "Слово", + "es": "Palabras", + "sv": "ord" + } + }, + { + "bath": { + "fr": "baignade", + "de": "Bad", + "it": "Il bagno", + "pt": "banho", + "ru": "ванны", + "es": "baño", + "sv": "Badrum" + } + }, + { + "bear": { + "fr": "Les ours", + "de": "Bären", + "it": "Il miele", + "pt": "A Beira", + "ru": "Медведь", + "es": "El Bear", + "sv": "Björn" + } + }, + { + "bell": { + "fr": "Bélon", + "de": "Bellen", + "it": "di Bell", + "pt": "Bela", + "ru": "Белл", + "es": "Bellas", + "sv": "Klocka" + } + }, + { + "tooth": { + "fr": "Les dents", + "de": "Zähne", + "it": "Il dente", + "pt": "Dentes", + "ru": "Зуб", + "es": "Dientes", + "sv": "tänder" + } + }, + { + "thumb": { + "fr": "Tumeur", + "de": "Dumm", + "it": "Peccato", + "pt": "Duma", + "ru": "Дюм", + "es": "Tumba", + "sv": "Dumma" + } + }, + { + "lightning": { + "fr": "éclairage", + "de": "Leuchten", + "it": "illuminazione", + "pt": "Iluminação", + "ru": "светильник", + "es": "Iluminación", + "sv": "Ljuset" + } + }, + { + "thunder": { + "fr": "Thunder", + "de": "Thunder", + "it": "di Thunder", + "pt": "Tandem", + "ru": "Тондер", + "es": "El Thunder", + "sv": "Thunder" + } + }, + { + "ticket": { + "fr": "Les billets", + "de": "Tickets", + "it": "biglietto", + "pt": "Bilhete", + "ru": "Билеты", + "es": "Título", + "sv": "Biljett" + } + }, + { + "tray": { + "fr": "Trois", + "de": "Dreie", + "it": "Trai", + "pt": "Três", + "ru": "Трей", + "es": "Trio", + "sv": "Tray" + } + }, + { + "tree": { + "fr": "Arbre", + "de": "Bäume", + "it": "albero", + "pt": "Árvore", + "ru": "Деревья", + "es": "árboles", + "sv": "Träd" + } + }, + { + "salt": { + "fr": "Le sel", + "de": "Salz", + "it": "Il sale", + "pt": "Sal", + "ru": "Соль", + "es": "Sal", + "sv": "salt" + } + }, + { + "secretary": { + "fr": "Secrétaire", + "de": "Sekretär", + "it": "Segretario", + "pt": "Secretário", + "ru": "Секретарь", + "es": "Secretario", + "sv": "Sekreterare" + } + }, + { + "shelf": { + "fr": "Shelleau", + "de": "Schiff", + "it": "Il Shell", + "pt": "Shelby", + "ru": "Шелф", + "es": "El shelf", + "sv": "Shelby" + } + }, + { + "hat": { + "fr": "Chapeau", + "de": "Hatten", + "it": "Il cappello", + "pt": "Cabeça", + "ru": "Шапка", + "es": "Caballero", + "sv": "Hatten" + } + }, + { + "dress": { + "fr": "Vêtements", + "de": "Kleidung", + "it": "vestito", + "pt": "vestido", + "ru": "Одежда", + "es": "El vestido", + "sv": "Klänning" + } + }, + { + "daughter": { + "fr": "Fille", + "de": "Tochter", + "it": "figlia", + "pt": "Filha", + "ru": "дочь", + "es": "hija", + "sv": "dotter" + } + }, + { + "son": { + "fr": "Fils", + "de": "Sohn", + "it": "Figlio", + "pt": "Filho", + "ru": "Сын", + "es": "hijo", + "sv": "Sonen" + } + }, + { + "soup": { + "fr": "soupe", + "de": "Suppe", + "it": "La zuppa", + "pt": "Sopa", + "ru": "Суп", + "es": "Sopa", + "sv": "Sopp" + } + }, + { + "space": { + "fr": "Espace", + "de": "Raum", + "it": "Spazio", + "pt": "Espaço", + "ru": "пространство", + "es": "Espacio", + "sv": "utrymme" + } + }, + { + "car": { + "fr": "voiture", + "de": "Autos", + "it": "auto", + "pt": "carro", + "ru": "Автомобили", + "es": "El coche", + "sv": "Bilen" + } + }, + { + "circle": { + "fr": "Cirque", + "de": "Kreis", + "it": "Circolo", + "pt": "Círculo", + "ru": "Круг", + "es": "Círculo", + "sv": "cirkel" + } + }, + { + "sphere": { + "fr": "sphère", + "de": "Sphäre", + "it": "La sfera", + "pt": "Espécie", + "ru": "сфера", + "es": "Esfera", + "sv": "Sfera" + } + }, + { + "steel": { + "fr": "Acier", + "de": "Stahl", + "it": "Acciaio", + "pt": "Aço", + "ru": "сталь", + "es": "El acero", + "sv": "stål" + } + }, + { + "stomach": { + "fr": "Le ventre", + "de": "Magen", + "it": "dello stomaco", + "pt": "O estômago", + "ru": "желудок", + "es": "El estómago", + "sv": "magen" + } + }, + { + "store": { + "fr": "Boutique", + "de": "Geschäfte", + "it": "negozio", + "pt": "Loja", + "ru": "магазин", + "es": "La tienda", + "sv": "Butiken" + } + }, + { + "range": { + "fr": "Range", + "de": "Range", + "it": "Rango", + "pt": "Rango", + "ru": "Ранги", + "es": "Rango", + "sv": "Range" + } + }, + { + "pig": { + "fr": "Le porc", + "de": "Schweine", + "it": "Il maiale", + "pt": "Porco", + "ru": "Свинья", + "es": "El cerdo", + "sv": "Svin" + } + }, + { + "rice": { + "fr": "Le riz", + "de": "Riesen", + "it": "Il riso", + "pt": "O arroz", + "ru": "Рис", + "es": "El arroz", + "sv": "Rice" + } + } +] diff --git a/src/scribe_data/extract_transform/languages/English/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/English/nouns/format_nouns.py index 49b8ac07f..502a2f220 100644 --- a/src/scribe_data/extract_transform/languages/English/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/English/nouns/format_nouns.py @@ -1,36 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the English nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "English" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="English", data_type="nouns" +) nouns_formatted = {} @@ -94,21 +80,11 @@ nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="English", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/English/translations/translate_words.py b/src/scribe_data/extract_transform/languages/English/translations/translate_words.py new file mode 100644 index 000000000..1efff8aac --- /dev/null +++ b/src/scribe_data/extract_transform/languages/English/translations/translate_words.py @@ -0,0 +1,86 @@ +""" +Translates the English words queried from Wikidata to all other Scribe languages. +""" + +import json +import os +import signal + +from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer + + +def translate_words(words_path: str): + with open(words_path, "r", encoding="utf-8") as file: + words_json_data = json.load(file) + + word_list = [] + + for item in words_json_data: + word_list.append(item["word"]) + + model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M") + tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M") + + with open( + "../../../../../scribe_data/resources/language_meta_data.json", + "r", + encoding="utf-8", + ) as file: + lang_json_data = json.load(file) + iso_list = [lang["iso"] for lang in lang_json_data["languages"]] + + target_languages = iso_list + + translations = [] + + if os.path.exists("../formatted_data/translated_words.json"): + with open( + "../formatted_data/translated_words.json", "r", encoding="utf-8" + ) as file: + translations = json.load(file) + + def signal_handler(sig, frame): + print( + "\nThe interrupt signal has been caught and the current progress is being saved..." + ) + with open( + "../formatted_data/translated_words.json", "w", encoding="utf-8" + ) as file: + json.dump(translations, file, ensure_ascii=False, indent=4) + file.write("\n") + + print("The current progress has been saved to the translated_words.json file.") + exit() + + signal.signal(signal.SIGINT, signal_handler) + + for word in word_list[len(translations) :]: + word_translations = {word: {}} + for lang_code in target_languages: + tokenizer.src_lang = "en" + encoded_word = tokenizer(word, return_tensors="pt") + generated_tokens = model.generate( + **encoded_word, forced_bos_token_id=tokenizer.get_lang_id(lang_code) + ) + translated_word = tokenizer.batch_decode( + generated_tokens, skip_special_tokens=True + )[0] + word_translations[word][lang_code] = translated_word + + translations.append(word_translations) + + with open( + "../formatted_data/translated_words.json", "w", encoding="utf-8" + ) as file: + json.dump(translations, file, ensure_ascii=False, indent=4) + file.write("\n") + + print(f"Translation results for the word '{word}' have been saved.") + + print( + "Translation results for all words are saved to the translated_words.json file." + ) + + +if __name__ == "__main__": + translate_words("words_to_translate.json") diff --git a/src/scribe_data/extract_transform/languages/English/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/English/verbs/format_verbs.py index 17004ca39..242199227 100644 --- a/src/scribe_data/extract_transform/languages/English/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/English/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the English verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "English" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="English", data_type="verbs" +) verbs_formatted = {} @@ -81,21 +67,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="English", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/French/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/French/nouns/format_nouns.py index c5346618b..569c81ffb 100644 --- a/src/scribe_data/extract_transform/languages/French/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/French/nouns/format_nouns.py @@ -1,48 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the French nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "French" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders +file_path = sys.argv[0] -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="French", data_type="nouns" +) def order_annotations(annotation): @@ -125,21 +99,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="French", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/French/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/French/verbs/format_verbs.py index 7f19437dc..8310504a4 100644 --- a/src/scribe_data/extract_transform/languages/French/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/French/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the French verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "French" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="French", data_type="verbs" +) verbs_formatted = {} @@ -78,21 +64,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="French", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py index 3dd8a2f1a..a28c8700c 100644 --- a/src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/German/nouns/format_nouns.py @@ -1,55 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the German nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "German" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) - -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. +file_path = sys.argv[0] - Parameters - ---------- - wikidata_gender : str - The gender of the noun that was queried from WikiData - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - elif wikidata_gender in ["neuter", "Q1775461"]: - return "N" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="German", data_type="nouns" +) def order_annotations(annotation): @@ -194,21 +161,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="German", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/German/prepositions/format_prepositions.py b/src/scribe_data/extract_transform/languages/German/prepositions/format_prepositions.py index 85332f7ed..69486dcb7 100644 --- a/src/scribe_data/extract_transform/languages/German/prepositions/format_prepositions.py +++ b/src/scribe_data/extract_transform/languages/German/prepositions/format_prepositions.py @@ -1,36 +1,22 @@ """ -Format Prepositions -------------------- - -Formats the prepositions queried from Wikidata using query_prepositions.sparql. +Formats the German prepositions queried from Wikidata using query_prepositions.sparql. """ import collections -import json import os import sys -LANGUAGE = "German" -QUERIED_DATA_TYPE = "prepositions" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - prepositions_list = json.load(f) +prepositions_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="German", data_type="prepositions" +) def convert_cases(case): @@ -119,21 +105,11 @@ def order_annotations(annotation): prepositions_formatted = collections.OrderedDict(sorted(prepositions_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(prepositions_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(prepositions_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=prepositions_formatted, + update_data_in_use=update_data_in_use, + language="German", + data_type="prepositions", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/German/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/German/verbs/format_verbs.py index b5c20a0a5..042170bdc 100644 --- a/src/scribe_data/extract_transform/languages/German/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/German/verbs/format_verbs.py @@ -1,8 +1,5 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the German verbs queried from Wikidata using query_verbs.sparql. Attn: The formatting in the file is significantly more complex than for other verbs. - We have two queries: query_verbs_1 and query_verbs_2. @@ -11,31 +8,20 @@ """ import collections -import json import os import sys -LANGUAGE = "German" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="German", data_type="verbs" +) verbs_formatted = {} @@ -157,21 +143,11 @@ def assign_past_participle(verb, tense): verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="German", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Greek/nouns/query_nouns.sparql b/src/scribe_data/extract_transform/languages/Greek/nouns/query_nouns.sparql new file mode 100644 index 000000000..b4b125ea9 --- /dev/null +++ b/src/scribe_data/extract_transform/languages/Greek/nouns/query_nouns.sparql @@ -0,0 +1,39 @@ +# All Greek (Q36510) nouns, their plural and their gender. +# Enter this query at https://query.wikidata.org/. + +SELECT DISTINCT ?singular ?plural ?gender WHERE { + + # Nouns and pronouns. + VALUES ?nounTypes { wd:Q1084 wd:Q147276 } + ?lexeme a ontolex:LexicalEntry ; + dct:language wd:Q36510; + wikibase:lexicalCategory ?noun . + FILTER(?noun = ?nounTypes) + + # Optional selection of singular forms. + OPTIONAL { + ?lexeme ontolex:lexicalForm ?singularForm . + ?singularForm ontolex:representation ?singular ; + wikibase:grammaticalFeature wd:Q131105 ; + wikibase:grammaticalFeature wd:Q110786 ; + } . + + # Optional selection of plural forms. + OPTIONAL { + ?lexeme ontolex:lexicalForm ?pluralForm . + ?pluralForm ontolex:representation ?plural ; + wikibase:grammaticalFeature wd:Q131105 ; + wikibase:grammaticalFeature wd:Q146786 ; + } . + + # Optional selection of genders. + OPTIONAL { + ?lexeme wdt:P5185 ?nounGender . + FILTER NOT EXISTS { ?lexeme wdt:P31 wd:Q48277} + } . + + SERVICE wikibase:label { + bd:serviceParam wikibase:language "[AUTO_LANGUAGE]". + ?nounGender rdfs:label ?gender . + } +} diff --git a/src/scribe_data/extract_transform/languages/Italian/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/Italian/nouns/format_nouns.py index 23f449591..7c6186167 100644 --- a/src/scribe_data/extract_transform/languages/Italian/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/Italian/nouns/format_nouns.py @@ -1,48 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the Italian nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "Italian" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders +file_path = sys.argv[0] -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Italian", data_type="nouns" +) def order_annotations(annotation): @@ -126,21 +100,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="Italian", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Italian/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/Italian/verbs/format_verbs.py index 53585db4b..6b905c212 100644 --- a/src/scribe_data/extract_transform/languages/Italian/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/Italian/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the Italian verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "Italian" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Italian", data_type="verbs" +) verbs_formatted = {} @@ -72,21 +58,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="Italian", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Portuguese/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/Portuguese/nouns/format_nouns.py index f6b35ecfe..280aeda50 100644 --- a/src/scribe_data/extract_transform/languages/Portuguese/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/Portuguese/nouns/format_nouns.py @@ -1,48 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the Portuguese nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "Portuguese" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders +file_path = sys.argv[0] -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Portuguese", data_type="nouns" +) def order_annotations(annotation): @@ -126,21 +100,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="Portuguese", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Portuguese/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/Portuguese/verbs/format_verbs.py index 0a5b2d0b2..285bb7cf0 100644 --- a/src/scribe_data/extract_transform/languages/Portuguese/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/Portuguese/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the Portuguese verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "Portuguese" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Portuguese", data_type="verbs" +) verbs_formatted = {} @@ -72,21 +58,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="Portuguese", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Russian/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/Russian/nouns/format_nouns.py index e681de4ec..4c3ae8533 100644 --- a/src/scribe_data/extract_transform/languages/Russian/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/Russian/nouns/format_nouns.py @@ -1,55 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the Russian nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "Russian" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) - -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. +file_path = sys.argv[0] - Parameters - ---------- - wikidata_gender : str - The gender of the noun that was queried from WikiData - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - elif wikidata_gender in ["neuter", "Q1775461"]: - return "N" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Russian", data_type="nouns" +) def order_annotations(annotation): @@ -194,21 +161,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="Russian", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Russian/prepositions/format_prepositions.py b/src/scribe_data/extract_transform/languages/Russian/prepositions/format_prepositions.py index 7a3c78133..4dd10a134 100644 --- a/src/scribe_data/extract_transform/languages/Russian/prepositions/format_prepositions.py +++ b/src/scribe_data/extract_transform/languages/Russian/prepositions/format_prepositions.py @@ -1,36 +1,22 @@ """ -Format Prepositions -------------------- - -Formats the prepositions queried from Wikidata using query_prepositions.sparql. +Formats the Russian prepositions queried from Wikidata using query_prepositions.sparql. """ import collections -import json import os import sys -LANGUAGE = "Russian" -QUERIED_DATA_TYPE = "prepositions" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - prepositions_list = json.load(f) +prepositions_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Russian", data_type="prepositions" +) def convert_cases(case): @@ -91,21 +77,11 @@ def order_annotations(annotation): prepositions_formatted = collections.OrderedDict(sorted(prepositions_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(prepositions_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(prepositions_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=prepositions_formatted, + update_data_in_use=update_data_in_use, + language="Russian", + data_type="prepositions", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Russian/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/Russian/verbs/format_verbs.py index c81102886..4c914bee0 100644 --- a/src/scribe_data/extract_transform/languages/Russian/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/Russian/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the Russian verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "Russian" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Russian", data_type="verbs" +) verbs_formatted = {} @@ -58,21 +44,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="Russian", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Spanish/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/Spanish/nouns/format_nouns.py index fd1f898f8..4339fb38a 100644 --- a/src/scribe_data/extract_transform/languages/Spanish/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/Spanish/nouns/format_nouns.py @@ -1,48 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the Spanish nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "Spanish" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] - -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders +file_path = sys.argv[0] -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. - """ - if wikidata_gender in ["masculine", "Q499327"]: - return "M" - elif wikidata_gender in ["feminine", "Q1775415"]: - return "F" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Spanish", data_type="nouns" +) def order_annotations(annotation): @@ -126,21 +100,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="Spanish", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Spanish/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/Spanish/verbs/format_verbs.py index 4685eb34a..43ede52ea 100644 --- a/src/scribe_data/extract_transform/languages/Spanish/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/Spanish/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the Spanish verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "Spanish" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Spanish", data_type="verbs" +) verbs_formatted = {} @@ -72,21 +58,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="Spanish", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Swedish/nouns/format_nouns.py b/src/scribe_data/extract_transform/languages/Swedish/nouns/format_nouns.py index f6d92284b..6fca6254f 100644 --- a/src/scribe_data/extract_transform/languages/Swedish/nouns/format_nouns.py +++ b/src/scribe_data/extract_transform/languages/Swedish/nouns/format_nouns.py @@ -1,48 +1,22 @@ """ -Format Nouns ------------- - -Formats the nouns queried from Wikidata using query_nouns.sparql. +Formats the Swedish nouns queried from Wikidata using query_nouns.sparql. """ import collections -import json import os import sys -LANGUAGE = "Swedish" -QUERIED_DATA_TYPE = "nouns" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) - -file_path = sys.argv[0] +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) - -with open(data_path, encoding="utf-8") as f: - nouns_list = json.load(f) +from scribe_data.utils import export_formatted_data, load_queried_data, map_genders +file_path = sys.argv[0] -def map_genders(wikidata_gender): - """ - Maps those genders from Wikidata to succinct versions. - """ - if wikidata_gender in ["common gender", "Q1305037"]: - return "C" - elif wikidata_gender in ["neuter", "Q1775461"]: - return "N" - else: - return "" # nouns could have a gender that is not valid as an attribute +nouns_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Swedish", data_type="nouns" +) def order_annotations(annotation): @@ -91,9 +65,9 @@ def order_annotations(annotation): # Plural is same as singular. else: - nouns_formatted[noun_vals["nominativeSingular"]][ - "plural" - ] = noun_vals["nominativePlural"] + nouns_formatted[noun_vals["nominativeSingular"]]["plural"] = ( + noun_vals["nominativePlural"] + ) nouns_formatted[noun_vals["nominativeSingular"]]["form"] = ( nouns_formatted[noun_vals["nominativeSingular"]]["form"] + "/PL" ) @@ -109,9 +83,9 @@ def order_annotations(annotation): ) elif nouns_formatted[noun_vals["nominativeSingular"]]["gender"] == "": - nouns_formatted[noun_vals["nominativeSingular"]][ - "gender" - ] = map_genders(noun_vals["gender"]) + nouns_formatted[noun_vals["nominativeSingular"]]["gender"] = ( + map_genders(noun_vals["gender"]) + ) elif "genitiveSingular" in noun_vals.keys(): if noun_vals["genitiveSingular"] not in nouns_formatted: @@ -138,9 +112,9 @@ def order_annotations(annotation): # Plural is same as singular. else: - nouns_formatted[noun_vals["genitiveSingular"]][ - "plural" - ] = noun_vals["genitivePlural"] + nouns_formatted[noun_vals["genitiveSingular"]]["plural"] = ( + noun_vals["genitivePlural"] + ) nouns_formatted[noun_vals["genitiveSingular"]]["form"] = ( nouns_formatted[noun_vals["genitiveSingular"]]["form"] + "/PL" ) @@ -156,9 +130,9 @@ def order_annotations(annotation): ) elif nouns_formatted[noun_vals["genitiveSingular"]]["gender"] == "": - nouns_formatted[noun_vals["genitiveSingular"]][ - "gender" - ] = map_genders(noun_vals["gender"]) + nouns_formatted[noun_vals["genitiveSingular"]]["gender"] = ( + map_genders(noun_vals["gender"]) + ) # Plural only noun. elif "nominativePlural" in noun_vals.keys(): @@ -170,9 +144,9 @@ def order_annotations(annotation): # Plural is same as singular. else: - nouns_formatted[noun_vals["nominativeSingular"]][ - "nominativePlural" - ] = noun_vals["nominativePlural"] + nouns_formatted[noun_vals["nominativeSingular"]]["nominativePlural"] = ( + noun_vals["nominativePlural"] + ) nouns_formatted[noun_vals["nominativeSingular"]]["form"] = ( nouns_formatted[noun_vals["nominativeSingular"]]["form"] + "/PL" ) @@ -187,9 +161,9 @@ def order_annotations(annotation): # Plural is same as singular. else: - nouns_formatted[noun_vals["genitiveSingular"]][ - "genitivePlural" - ] = noun_vals["genitivePlural"] + nouns_formatted[noun_vals["genitiveSingular"]]["genitivePlural"] = ( + noun_vals["genitivePlural"] + ) nouns_formatted[noun_vals["genitiveSingular"]]["form"] = ( nouns_formatted[noun_vals["genitiveSingular"]]["form"] + "/PL" ) @@ -199,21 +173,11 @@ def order_annotations(annotation): nouns_formatted = collections.OrderedDict(sorted(nouns_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(nouns_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(nouns_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=nouns_formatted, + update_data_in_use=update_data_in_use, + language="Swedish", + data_type="nouns", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/languages/Swedish/verbs/format_verbs.py b/src/scribe_data/extract_transform/languages/Swedish/verbs/format_verbs.py index c18234096..01767e6e5 100644 --- a/src/scribe_data/extract_transform/languages/Swedish/verbs/format_verbs.py +++ b/src/scribe_data/extract_transform/languages/Swedish/verbs/format_verbs.py @@ -1,36 +1,22 @@ """ -Format Verbs ------------- - -Formats the verbs queried from Wikidata using query_verbs.sparql. +Formats the Swedish verbs queried from Wikidata using query_verbs.sparql. """ import collections -import json import os import sys -LANGUAGE = "Swedish" -QUERIED_DATA_TYPE = "verbs" -QUERIED_DATA_FILE = f"{QUERIED_DATA_TYPE}_queried.json" PATH_TO_SCRIBE_ORG = os.path.dirname(sys.path[0]).split("Scribe-Data")[0] -LANGUAGES_DIR_PATH = ( - f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages" -) +PATH_TO_SCRIBE_DATA_SRC = f"{PATH_TO_SCRIBE_ORG}Scribe-Data/src" +sys.path.insert(0, PATH_TO_SCRIBE_DATA_SRC) -file_path = sys.argv[0] +from scribe_data.utils import export_formatted_data, load_queried_data -update_data_in_use = False # check if update_data.py is being used -if f"languages/{LANGUAGE}/{QUERIED_DATA_TYPE}/" not in file_path: - data_path = QUERIED_DATA_FILE -else: - update_data_in_use = True - data_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/{QUERIED_DATA_TYPE}/{QUERIED_DATA_FILE}" - ) +file_path = sys.argv[0] -with open(data_path, encoding="utf-8") as f: - verbs_list = json.load(f) +verbs_list, update_data_in_use, data_path = load_queried_data( + file_path=file_path, language="Swedish", data_type="verbs" +) verbs_formatted = {} @@ -65,21 +51,11 @@ verbs_formatted = collections.OrderedDict(sorted(verbs_formatted.items())) -export_path = f"../formatted_data/{QUERIED_DATA_TYPE}.json" -if update_data_in_use: - export_path = ( - f"{LANGUAGES_DIR_PATH}/{LANGUAGE}/formatted_data/{QUERIED_DATA_TYPE}.json" - ) - -with open( - export_path, - "w", - encoding="utf-8", -) as file: - json.dump(verbs_formatted, file, ensure_ascii=False, indent=0) - -print( - f"Wrote file {QUERIED_DATA_TYPE}.json with {len(verbs_formatted):,} {QUERIED_DATA_TYPE}." +export_formatted_data( + formatted_data=verbs_formatted, + update_data_in_use=update_data_in_use, + language="Swedish", + data_type="verbs", ) os.remove(data_path) diff --git a/src/scribe_data/extract_transform/process_unicode.py b/src/scribe_data/extract_transform/process_unicode.py index 1e8083c3f..5185cd1e2 100644 --- a/src/scribe_data/extract_transform/process_unicode.py +++ b/src/scribe_data/extract_transform/process_unicode.py @@ -1,14 +1,10 @@ """ -Process Unicode ---------------- - Module for processing Unicode based corpuses for autosuggestion and autocompletion generation. Contents: gen_emoji_lexicon """ - import csv import fileinput import json @@ -227,7 +223,7 @@ def gen_emoji_lexicon( path_to_data_table = ( get_path_from_et_dir() - + "/Scribe-Data/src/scribe_data/load/_update_files/data_table.txt" + + "/Scribe-Data/src/scribe_data/load/update_files/data_table.txt" ) for line in fileinput.input(path_to_data_table, inplace=True): @@ -243,7 +239,7 @@ def gen_emoji_lexicon( path_to_total_data = ( get_path_from_et_dir() - + "/Scribe-Data/src/scribe_data/load/_update_files/total_data.json" + + "/Scribe-Data/src/scribe_data/load/update_files/total_data.json" ) with open(path_to_total_data, encoding="utf-8") as f: diff --git a/src/scribe_data/extract_transform/process_wiki.py b/src/scribe_data/extract_transform/process_wiki.py index c29a9cbae..c937d05a6 100644 --- a/src/scribe_data/extract_transform/process_wiki.py +++ b/src/scribe_data/extract_transform/process_wiki.py @@ -1,7 +1,4 @@ """ -Process Wiki ------------- - Module for cleaning Wikipedia based corpuses for autosuggestion generation. Contents: @@ -380,9 +377,7 @@ def gen_autosuggestions( print("Nothing returned by the WDQS server for query_profanity.sparql") else: # Subset the returned JSON and the individual results before saving. - query_results = results["results"][ # pylint: disable=unsubscriptable-object - "bindings" - ] + query_results = results["results"]["bindings"] # pylint: disable=unsubscriptable-object for r in query_results: # query_results is also a list r_dict = {k: r[k]["value"] for k in r.keys()} diff --git a/src/scribe_data/extract_transform/translate.py b/src/scribe_data/extract_transform/translate.py index bc1150a6d..538e0f214 100644 --- a/src/scribe_data/extract_transform/translate.py +++ b/src/scribe_data/extract_transform/translate.py @@ -1,7 +1,4 @@ """ -Translate ---------- - Translates the words queried from Wikidata using query_words_to_translate.sparql. Example diff --git a/src/scribe_data/extract_transform/translation/__init__.py b/src/scribe_data/extract_transform/translation/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/scribe_data/extract_transform/unicode/__init__.py b/src/scribe_data/extract_transform/unicode/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/scribe_data/extract_transform/gen_emoji_lexicon.ipynb b/src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb similarity index 100% rename from src/scribe_data/extract_transform/gen_emoji_lexicon.ipynb rename to src/scribe_data/extract_transform/unicode/gen_emoji_lexicon.ipynb diff --git a/src/scribe_data/extract_transform/update_data.py b/src/scribe_data/extract_transform/update_data.py index adf7844de..23d274459 100644 --- a/src/scribe_data/extract_transform/update_data.py +++ b/src/scribe_data/extract_transform/update_data.py @@ -1,7 +1,4 @@ """ -Update Data ------------ - Updates data for Scribe by running all or desired WDQS queries and formatting scripts. Parameters @@ -37,7 +34,7 @@ SCRIBE_DATA_SRC_PATH = "src/scribe_data" PATH_TO_ET_LANGUAGE_FILES = f"{SCRIBE_DATA_SRC_PATH}/extract_transform/languages" -PATH_TO_UPDATE_FILES = f"{SCRIBE_DATA_SRC_PATH}/load/_update_files" +PATH_TO_UPDATE_FILES = f"{SCRIBE_DATA_SRC_PATH}/load/update_files" # Set SPARQLWrapper query conditions. sparql = SPARQLWrapper("https://query.wikidata.org/sparql") diff --git a/src/scribe_data/extract_transform/update_words_to_translate.py b/src/scribe_data/extract_transform/update_words_to_translate.py index ca2389042..ab12a44d5 100644 --- a/src/scribe_data/extract_transform/update_words_to_translate.py +++ b/src/scribe_data/extract_transform/update_words_to_translate.py @@ -1,7 +1,4 @@ """ -Update Words to Translate -------------------------- - Updates words to translate by running the WDQS query for the given languages. Parameters diff --git a/src/scribe_data/extract_transform/wikidata/__init__.py b/src/scribe_data/extract_transform/wikidata/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/scribe_data/extract_transform/wikipedia/__init__.py b/src/scribe_data/extract_transform/wikipedia/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/src/scribe_data/extract_transform/gen_autosuggestions.ipynb b/src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb similarity index 100% rename from src/scribe_data/extract_transform/gen_autosuggestions.ipynb rename to src/scribe_data/extract_transform/wikipedia/gen_autosuggestions.ipynb diff --git a/src/scribe_data/load/data_to_sqlite.py b/src/scribe_data/load/data_to_sqlite.py index 4e1922437..50adde110 100644 --- a/src/scribe_data/load/data_to_sqlite.py +++ b/src/scribe_data/load/data_to_sqlite.py @@ -1,7 +1,4 @@ """ -Data to SQLite --------------- - Converts all or desired JSON data generated by update_data into SQLite databases. Parameters @@ -30,7 +27,7 @@ PATH_TO_ET_FILES = "../extract_transform/" -with open("_update_files/total_data.json", encoding="utf-8") as f: +with open("update_files/total_data.json", encoding="utf-8") as f: current_data = json.load(f) current_languages = list(current_data.keys()) diff --git a/src/scribe_data/load/send_dbs_to_scribe.py b/src/scribe_data/load/send_dbs_to_scribe.py index afcdf487f..4da7b93a4 100644 --- a/src/scribe_data/load/send_dbs_to_scribe.py +++ b/src/scribe_data/load/send_dbs_to_scribe.py @@ -1,7 +1,4 @@ """ -Send Databases to Scribe ------------------------- - Updates Scribe apps with the SQLite language databases that are found in src/load/databases. Example diff --git a/src/scribe_data/load/_update_files/data_table.txt b/src/scribe_data/load/update_files/data_table.txt similarity index 100% rename from src/scribe_data/load/_update_files/data_table.txt rename to src/scribe_data/load/update_files/data_table.txt diff --git a/src/scribe_data/load/_update_files/data_updates.txt b/src/scribe_data/load/update_files/data_updates.txt similarity index 100% rename from src/scribe_data/load/_update_files/data_updates.txt rename to src/scribe_data/load/update_files/data_updates.txt diff --git a/src/scribe_data/load/_update_files/total_data.json b/src/scribe_data/load/update_files/total_data.json similarity index 100% rename from src/scribe_data/load/_update_files/total_data.json rename to src/scribe_data/load/update_files/total_data.json diff --git a/src/scribe_data/utils.py b/src/scribe_data/utils.py index de40b9962..b73bb2ab0 100644 --- a/src/scribe_data/utils.py +++ b/src/scribe_data/utils.py @@ -12,6 +12,9 @@ get_language_words_to_ignore, get_language_dir_path, get_path_from_format_file, + get_language_dir_path, + load_queried_data, + export_formatted_data, get_path_from_load_dir, get_path_from_et_dir, get_ios_data_path, @@ -21,7 +24,8 @@ check_and_return_command_line_args, translation_interrupt_handler, get_target_langcodes, - translate_to_other_languages + translate_to_other_languages, + map_genders """ import ast @@ -266,6 +270,67 @@ def get_language_dir_path(language): return f"{PATH_TO_SCRIBE_ORG}/Scribe-Data/src/scribe_data/extract_transform/languages/{language}" +def load_queried_data(file_path, language, data_type): + """ + Loads queried data from a JSON file for a specific language and data type. + + Parameters + ---------- + file_path : str + The path to the file containing the queried data. + language : str + The language for which the data is being loaded. + data_type : str + The type of data being loaded (e.g., 'nouns', 'verbs'). + + Returns + ------- + tuple + A tuple containing the loaded data, a boolean indicating whether the data is in use, + and the path to the data file. + """ + queried_data_file = f"{data_type}_queried.json" + update_data_in_use = False + + if f"languages/{language}/{data_type}/" not in file_path: + data_path = queried_data_file + else: + update_data_in_use = True + data_path = f"{_get_language_dir_path(language)}/{data_type}/{queried_data_file}" + + with open(data_path, encoding="utf-8") as f: + return json.load(f), update_data_in_use, data_path + + +def export_formatted_data(formatted_data, update_data_in_use, language, data_type): + """ + Exports formatted data to a JSON file for a specific language and data type. + + Parameters + ---------- + formatted_data : dict + The data to be exported. + update_data_in_use : bool + A flag indicating whether the data is currently in use. + language : str + The language for which the data is being exported. + data_type : str + The type of data being exported (e.g., 'nouns', 'verbs'). + + Returns + ------- + None + """ + if update_data_in_use: + export_path = f"{_get_language_dir_path(language)}/formatted_data/{data_type}.json" + else: + export_path = f"{data_type}.json" + + with open(export_path, "w", encoding="utf-8") as file: + json.dump(formatted_data, file, ensure_ascii=False, indent=0) + print(f"Wrote file {data_type}.json with {len(formatted_data):,} {data_type}.") + + def get_path_from_format_file() -> str: """ Returns the directory path from a data formatting file to scribe-org. @@ -529,3 +594,24 @@ def translate_to_other_languages(source_language, word_list, translations, batch json.dump(translations, file, ensure_ascii=False, indent=4) print("Translation results for all words are saved to the translated_words.json file.") + + +def map_genders(wikidata_gender): + """ + Maps those genders from Wikidata to succinct versions. + + Parameters + ---------- + wikidata_gender : str + The gender of the noun that was queried from WikiData. + """ + if wikidata_gender in ["masculine", "Q499327"]: + return "M" + elif wikidata_gender in ["feminine", "Q1775415"]: + return "F" + elif wikidata_gender in ["common gender", "Q1305037"]: + return "C" + elif wikidata_gender in ["neuter", "Q1775461"]: + return "N" + else: + return "" # nouns could have a gender that is not valid as an attribute