Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fixes #72: A script teanslates English to other languages #81

Merged
merged 8 commits into from
Mar 17, 2024

Conversation

Linfye
Copy link
Contributor

@Linfye Linfye commented Feb 27, 2024

Contributor checklist


Description

The script can run on Google Colab and that's where I code on. Because running the script will take too much time, the translated_words are only a part of all the words. But it shows the feasibility of the program.

Looking forward to your code review

Related issue

Copy link

github-actions bot commented Feb 27, 2024

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. It'd be great to have you!

Maintainer checklist

  • The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution

    • The contributor's name and icon in remote commits should be the same as what appears in the PR
    • If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
  • The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

@andrewtavis
Copy link
Member

Thanks for this, @Linfye! I'll get back to you with a review soon! Once this is merged you'd be welcome to work on the other languages :)

Copy link
Member

@wkyoshida wkyoshida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome contribution @Linfye 🚀 Thanks so much for the work here!!

I'll let @andrewtavis add his review here as well, but just adding some comments of my own too with some ideas 😉

"sv": "Rice"
}
}
]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
Add a newline at the end of both files


There are some uncommon issues that can occasionally occur with the absence of an ending newline. I've mostly seen it happen with large files 🙃

@@ -0,0 +1,49 @@
from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to leave the __init__.py file here as empty as in other places throughout the repo? The code here could then be moved to a different file, such as a translate_words.py file or something alike perhaps.

Comment on lines 6 to 49
with open('words_to_translate.json', 'r', encoding='utf-8') as file:
json_data = json.load(file)

word_list = []

for item in json_data:
word_list.append(item["word"])

#print(word_list[0])

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

target_languages = ["fr", "de", "it", "pt", "ru", "es", "sv"]

translations = []

if os.path.exists('../formatted_data/translated_words.json'):
with open('../formatted_data/translated_words.json', 'r', encoding='utf-8') as file:
translations = json.load(file)

def signal_handler(sig, frame):
print("\nThe interrupt signal has been caught and the current progress is being saved...")
with open('../formatted_data/translated_words.json', 'w', encoding='utf-8') as file:
json.dump(translations, file, ensure_ascii=False, indent=4)
print("The current progress is saved to the translated_words.json file.")
exit()

signal.signal(signal.SIGINT, signal_handler)

for word in word_list[len(translations):]:
word_translations = {word: {}}
for lang_code in target_languages:
tokenizer.src_lang = "en"
encoded_word = tokenizer(word, return_tensors="pt")
generated_tokens = model.generate(**encoded_word, forced_bos_token_id=tokenizer.get_lang_id(lang_code))
translated_word = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
word_translations[word][lang_code] = translated_word
translations.append(word_translations)
with open('../formatted_data/translated_words.json', 'w', encoding='utf-8') as file:
json.dump(translations, file, ensure_ascii=False, indent=4)
print(f"Translation results for the word '{word}' have been saved.")

print("Translation results for all words are saved to the translated_words.json file.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking that it likely makes sense to put the code within a function. That way it can be more easily callable from elsewhere in the project.

for item in json_data:
word_list.append(item["word"])

#print(word_list[0])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:
We can go ahead and remove commented-out code

model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")

target_languages = ["fr", "de", "it", "pt", "ru", "es", "sv"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another idea could be to use the utils.py that we have already to get the languages (and then their ISO codes) that we would need to translate to.

That way we won't have to remember to update the list here too whenever a new language is supported 😄

"it": "Il riso",
"pt": "O arroz",
"ru": "Рис",
"es": "El arroz",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There appear to be several instances where the article is also brought in, e.g. El in "El arroz" here.
Is there a way to ignore articles when finding the translations?


I guess - does ignoring articles make sense? CC: @andrewtavis

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make sense, but maybe we could do a separate issue for it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to include it in here if @Linfye is comfortable with it :)

Copy link
Contributor

@shashank-iitbhu shashank-iitbhu Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could make sense, but maybe we could do a separate issue for it?

@andrewtavis I already have a branch in my local where i have moved the tranlation functions to utils.py, If you allow i can open a separate PR for this. or can open a PR for #77 russian translation process.
then other contributors can take reference from that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate PR for this sounds good, @shashank-iitbhu! From there we can jump to the Russian :) Appreciate you reaching out here!

@Linfye
Copy link
Contributor Author

Linfye commented Mar 4, 2024

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

@shashank-iitbhu
Copy link
Contributor

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

Can you refer to #88 and #89 ? I have implemented a different approach i.e batch processing of words for translation. This way it is relatively faster.

We can decide on a single approach, if the requirement is to iterate over each word rather than batch processing then we can go ahead with this PR.
cc @andrewtavis @wkyoshida

@Linfye
Copy link
Contributor Author

Linfye commented Mar 10, 2024

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

Can you refer to #88 and #89 ? I have implemented a different approach i.e batch processing of words for translation. This way it is relatively faster.

We can decide on a single approach, if the requirement is to iterate over each word rather than batch processing then we can go ahead with this PR. cc @andrewtavis @wkyoshida

I check the code and wonder if continue downloading from last progress cause the words are too much. If it works better, we can adopt yours.

@andrewtavis
Copy link
Member

Sorry for the delay on all of this, all :) I was on vacation and then sick right after... Checked and sent along some formatting in 2460584. I'll bring this in shortly as well as the work that's @shashank-iitbhu mentioned. I'll give it all a test to see how things are working. I'd say batch processing and having the process in the utils makes sense to me 😊

Copy link
Member

@andrewtavis andrewtavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great first contribution, @Linfye! Very important step in the the new translation process 😊 Let us know if there are other issues you're interested in!

@andrewtavis andrewtavis merged commit ea05b32 into scribe-org:main Mar 17, 2024
1 check passed
@andrewtavis
Copy link
Member

Ah, and a quick note on this: let's be sure to remove as much whitespace from JSON outputs as possible in the future as that does bring the file size down slightly 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants