Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phonetic transcription #395

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

springlaughing
Copy link
Contributor

refinery

  • Tested by creator on refinery
  • Tested by reviewer on refinery
  • Ensured that output of brick conforms with refinery structure (to be checked by reviewer)

API

  • Tested by creator on localhost:8000/docs
  • Tested by reviewer on localhost:8000/docs

common code

  • Common code tested in notebook/ script by creator
  • Common code tested in notebook/ script by reviewer
  • Common code contains docstrings and type hints

additional points:

  • Docstring and README is existing
  • Import statements (in __init__.py)
  • (If necessary) Added dependency to requirements.txt
  • (If necessary) Added dependency to issue for refinery env here
  • Published brick to Strapi CMS (locally)

Testing procedure:
When testing in refinery, please ensure that the output of the brick conforms with the structure of refinery.
For extraction bricks, this would be a tuple like ("label", span_start, span_end).
For classification bricks, this would be a string representing a label.
For generator bricks, this would either be a float, interger, string, boolean or a list, depending on the situation.

When testing the bricks, try to avoid using only one source of data. Meaning that you should not only use the clickbait sample
project, but also different texts with longer or more complex strings.

A small refinery example project with a variation of texts called bricks-test-data-project.zip can be found in the bricks repository.

@springlaughing
Copy link
Contributor Author

This one implements issue #278.

Hello, trying to make another brick, this time - phonetic transcriptor.
There are some things to note about this one:

  1. In general, Linux or WSL required (at least for English due to Flite)
  2. CEDICT .txt file is required for Chinese

Here are steps to organize the environment to run the package:

Install epitran: pip install epitran
Install jieba: pip install jieba

Get Flite for English:
git clone http://github.com/festvox/flite
cd flite
./configure
make
sudo make install
cd testsuite
make lex_lookup
sudo cp lex_lookup /usr/local/bin

Get Cedict for Chinese:
https://www.mdbg.net/chinese/dictionary?page=cedict - download and unpack, provide this path to cedict_path inside the phonetic_transcriptor function.

@LeonardPuettmannKern
Copy link
Contributor

Hi @springlaughing, thank you for the contribution! Code looks good so far, will test more thoroughly, though. As this brick will require some dependencies to be installed, we will most likely wait until the next release to merge this, as our dev team can then also add the requirements to our tool refinery for the bricks integration. Do you know if flite is definitely needed, or if only epitran or jieba are needed for this? :)

@springlaughing
Copy link
Contributor Author

Hi @springlaughing, thank you for the contribution! Code looks good so far, will test more thoroughly, though. As this brick will require some dependencies to be installed, we will most likely wait until the next release to merge this, as our dev team can then also add the requirements to our tool refinery for the bricks integration. Do you know if flite is definitely needed, or if only epitran or jieba are needed for this? :)

Yes, Flite is needed to be able to use epitran to get phonetic transcriptions for English language, here is the screenshot from epitran Github page https://github.com/dmort27/epitran:
image
Another thing is with Chinese: Cedict is needed to be able to use epitran for getting phonetic transcriptions for Chinese, as mentioned on the epitan page:
image
Additionally, I have used jieba as tokenizer for Chinese, but it shouldn't be a problem as it is a simple dependency install and MIT Licence.
:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants