fixes #72: A script teanslates English to other languages #81

Linfye · 2024-02-27T07:27:44Z

Contributor checklist

[] This pull request is on a separate branch and not the main branch

Description

The script can run on Google Colab and that's where I code on. Because running the script will take too much time, the translated_words are only a part of all the words. But it shows the feasibility of the program.

Looking forward to your code review

Related issue

#ISSUE_NUMBER
Create English to all other languages translation process #72

github-actions · 2024-02-27T07:28:03Z

Thank you for the pull request!

The Scribe team will do our best to address your contribution as soon as we can. The following is a checklist for maintainers to make sure this process goes as well as possible. Feel free to address the points below yourself in further commits if you realize that actions are needed :)

If you're not already a member of our public Matrix community, please consider joining! We'd suggest using Element as your Matrix client, and definitely join the General and Data rooms once you're in. It'd be great to have you!

Maintainer checklist

The commit messages for the remote branch should be checked to make sure the contributor's email is set up correctly so that they receive credit for their contribution
- The contributor's name and icon in remote commits should be the same as what appears in the PR
- If there's a mismatch, the contributor needs to make sure that the email they use for GitHub matches what they have for git config user.email in their local Scribe-Data repo
The CHANGELOG has been updated with a description of the changes for the upcoming release and the corresponding issue (if necessary)

andrewtavis · 2024-02-27T08:33:42Z

Thanks for this, @Linfye! I'll get back to you with a review soon! Once this is merged you'd be welcome to work on the other languages :)

wkyoshida

Awesome contribution @Linfye 🚀 Thanks so much for the work here!!

I'll let @andrewtavis add his review here as well, but just adding some comments of my own too with some ideas 😉

wkyoshida · 2024-03-04T01:38:04Z

src/scribe_data/extract_transform/languages/English/formatted_data/translated_words.json

+            "sv": "Rice"
+        }
+    }
+]


nit:
Add a newline at the end of both files

There are some uncommon issues that can occasionally occur with the absence of an ending newline. I've mostly seen it happen with large files 🙃

wkyoshida · 2024-03-04T01:44:04Z

src/scribe_data/extract_transform/languages/English/translations/__init__.py

@@ -0,0 +1,49 @@
+from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer


Does it make sense to leave the __init__.py file here as empty as in other places throughout the repo? The code here could then be moved to a different file, such as a translate_words.py file or something alike perhaps.

wkyoshida · 2024-03-04T01:50:11Z

src/scribe_data/extract_transform/languages/English/translations/__init__.py

+with open('words_to_translate.json', 'r', encoding='utf-8') as file:
+    json_data = json.load(file)
+
+word_list = []
+
+for item in json_data:
+    word_list.append(item["word"])
+
+#print(word_list[0])
+
+model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
+tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
+
+target_languages = ["fr", "de", "it", "pt", "ru", "es", "sv"]
+
+translations = []
+
+if os.path.exists('../formatted_data/translated_words.json'):
+    with open('../formatted_data/translated_words.json', 'r', encoding='utf-8') as file:
+        translations = json.load(file)
+
+def signal_handler(sig, frame):
+    print("\nThe interrupt signal has been caught and the current progress is being saved...")
+    with open('../formatted_data/translated_words.json', 'w', encoding='utf-8') as file:
+        json.dump(translations, file, ensure_ascii=False, indent=4)
+    print("The current progress is saved to the translated_words.json file.")
+    exit()
+
+signal.signal(signal.SIGINT, signal_handler)
+
+for word in word_list[len(translations):]:
+    word_translations = {word: {}}
+    for lang_code in target_languages:
+        tokenizer.src_lang = "en"
+        encoded_word = tokenizer(word, return_tensors="pt")
+        generated_tokens = model.generate(**encoded_word, forced_bos_token_id=tokenizer.get_lang_id(lang_code))
+        translated_word = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)[0]
+        word_translations[word][lang_code] = translated_word
+    translations.append(word_translations)
+    with open('../formatted_data/translated_words.json', 'w', encoding='utf-8') as file:
+        json.dump(translations, file, ensure_ascii=False, indent=4)
+    print(f"Translation results for the word '{word}' have been saved.")
+
+print("Translation results for all words are saved to the translated_words.json file.")


Thinking that it likely makes sense to put the code within a function. That way it can be more easily callable from elsewhere in the project.

wkyoshida · 2024-03-04T01:51:20Z

src/scribe_data/extract_transform/languages/English/translations/__init__.py

+for item in json_data:
+    word_list.append(item["word"])
+
+#print(word_list[0])


nit:
We can go ahead and remove commented-out code

wkyoshida · 2024-03-04T01:55:00Z

src/scribe_data/extract_transform/languages/English/translations/__init__.py

+model = M2M100ForConditionalGeneration.from_pretrained("facebook/m2m100_418M")
+tokenizer = M2M100Tokenizer.from_pretrained("facebook/m2m100_418M")
+
+target_languages = ["fr", "de", "it", "pt", "ru", "es", "sv"]


Another idea could be to use the utils.py that we have already to get the languages (and then their ISO codes) that we would need to translate to.

That way we won't have to remember to update the list here too whenever a new language is supported 😄

wkyoshida · 2024-03-04T01:59:13Z

src/scribe_data/extract_transform/languages/English/formatted_data/translated_words.json

+            "it": "Il riso",
+            "pt": "O arroz",
+            "ru": "Рис",
+            "es": "El arroz",


There appear to be several instances where the article is also brought in, e.g. El in "El arroz" here.
Is there a way to ignore articles when finding the translations?

I guess - does ignoring articles make sense? CC: @andrewtavis

This could make sense, but maybe we could do a separate issue for it?

Happy to include it in here if @Linfye is comfortable with it :)

This could make sense, but maybe we could do a separate issue for it?

@andrewtavis I already have a branch in my local where i have moved the tranlation functions to utils.py, If you allow i can open a separate PR for this. or can open a PR for #77 russian translation process.
then other contributors can take reference from that.

A separate PR for this sounds good, @shashank-iitbhu! From there we can jump to the Russian :) Appreciate you reaching out here!

Linfye · 2024-03-04T15:13:51Z

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

shashank-iitbhu · 2024-03-05T05:51:21Z

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

Can you refer to #88 and #89 ? I have implemented a different approach i.e batch processing of words for translation. This way it is relatively faster.

We can decide on a single approach, if the requirement is to iterate over each word rather than batch processing then we can go ahead with this PR.
cc @andrewtavis @wkyoshida

Linfye · 2024-03-10T13:55:12Z

I fixed the problems you mentioned except the last one. @wkyoshida I wonder now should I work on it or others works on it now. Looking forward to your reply.
cc @andrewtavis

Can you refer to #88 and #89 ? I have implemented a different approach i.e batch processing of words for translation. This way it is relatively faster.

We can decide on a single approach, if the requirement is to iterate over each word rather than batch processing then we can go ahead with this PR. cc @andrewtavis @wkyoshida

I check the code and wonder if continue downloading from last progress cause the words are too much. If it works better, we can adopt yours.

andrewtavis · 2024-03-17T13:03:56Z

Sorry for the delay on all of this, all :) I was on vacation and then sick right after... Checked and sent along some formatting in 2460584. I'll bring this in shortly as well as the work that's @shashank-iitbhu mentioned. I'll give it all a test to see how things are working. I'd say batch processing and having the process in the utils makes sense to me 😊

andrewtavis

Thanks for the great first contribution, @Linfye! Very important step in the the new translation process 😊 Let us know if there are other issues you're interested in!

andrewtavis · 2024-03-17T13:07:21Z

Ah, and a quick note on this: let's be sure to remove as much whitespace from JSON outputs as possible in the future as that does bring the file size down slightly 😊

Linfye added 3 commits February 27, 2024 14:53

English trans finished

1fa59ea

change dic

6e9d5e8

change file name

8311358

file name changed

960c4f6

wkyoshida reviewed Mar 4, 2024

View reviewed changes

wkyoshida and others added 3 commits March 3, 2024 23:04

Merge branch 'main' into main

11bb74e

minus fixed

ef8be67

Merge branch 'main' of https://github.com/Linfye/Scribe-Data

e2d0cdc

andrewtavis mentioned this pull request Mar 9, 2024

Remove articles from machine translation process #96

Closed

2 tasks

scribe-org#72 formatting for translation file and adding docstring

2460584

andrewtavis approved these changes Mar 17, 2024

View reviewed changes

andrewtavis merged commit ea05b32 into scribe-org:main Mar 17, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fixes #72: A script teanslates English to other languages #81

fixes #72: A script teanslates English to other languages #81

Linfye commented Feb 27, 2024

github-actions bot commented Feb 27, 2024 •

edited by andrewtavis

Loading

andrewtavis commented Feb 27, 2024

wkyoshida left a comment

wkyoshida Mar 4, 2024

wkyoshida Mar 4, 2024

wkyoshida Mar 4, 2024

wkyoshida Mar 4, 2024

wkyoshida Mar 4, 2024

wkyoshida Mar 4, 2024

andrewtavis Mar 4, 2024

andrewtavis Mar 4, 2024

shashank-iitbhu Mar 4, 2024 •

edited

Loading

andrewtavis Mar 4, 2024

Linfye commented Mar 4, 2024

shashank-iitbhu commented Mar 5, 2024

Linfye commented Mar 10, 2024

andrewtavis commented Mar 17, 2024

andrewtavis left a comment •

edited

Loading

andrewtavis commented Mar 17, 2024

		@@ -0,0 +1,49 @@
		from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

fixes #72: A script teanslates English to other languages #81

fixes #72: A script teanslates English to other languages #81

Conversation

Linfye commented Feb 27, 2024

Contributor checklist

Description

Related issue

github-actions bot commented Feb 27, 2024 • edited by andrewtavis Loading

Thank you for the pull request!

Maintainer checklist

andrewtavis commented Feb 27, 2024

wkyoshida left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shashank-iitbhu Mar 4, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Linfye commented Mar 4, 2024

shashank-iitbhu commented Mar 5, 2024

Linfye commented Mar 10, 2024

andrewtavis commented Mar 17, 2024

andrewtavis left a comment • edited Loading

Choose a reason for hiding this comment

andrewtavis commented Mar 17, 2024

github-actions bot commented Feb 27, 2024 •

edited by andrewtavis

Loading

shashank-iitbhu Mar 4, 2024 •

edited

Loading

andrewtavis left a comment •

edited

Loading