Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add translation funcs to utils #88

Conversation

shashank-iitbhu
Copy link
Contributor

Contributor checklist


Description

  • Implemented batch processing for translation of words for a batch_size of 100, this way it is significantly faster to translate.
  • Added four functions get_language_dir_path , translation_interrupt_handler, get_target_langcodes and translate_to_other_languages.
  • Added doc strings.

Related issue

Signed-off-by: Shashank Mittal <[email protected]>
Signed-off-by: Shashank Mittal <[email protected]>
Signed-off-by: Shashank Mittal <[email protected]>
Signed-off-by: Shashank Mittal <[email protected]>
Signed-off-by: Shashank Mittal <[email protected]>
Copy link

github-actions bot commented Mar 4, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

print(f"Translating batch {i//batch_size + 1}: {batch_words}")
for lang_code in get_target_langcodes(source_language):
tokenizer.src_lang = get_language_iso(source_language)
encoded_words = tokenizer(batch_words, return_tensors="pt", padding=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, it tokenizes the batch of words and converts them into a format suitable for the model. Due to this it becomes relatively faster to translate words.

@@ -10,18 +10,23 @@
get_language_from_iso,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comparing this PR with #89, it appears to me all changes here are already reflected in the other PR.

I am thinking of simply closing this one here and continuing on #89 directly - is that alright? Mostly to avoid making any changes here and then later having to redo/reconcile the same on the other PR.

CC @andrewtavis

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah!, The other PR was checked from this this branch, merging that one makes sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Sounds good then - I'll do my review on that other PR 👍

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks both! 😊

@wkyoshida wkyoshida closed this Mar 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants