Use big LLM to better align source and enhance source corpora #622

johnml1135 · 2025-01-08T20:42:22Z

So, this is a crazy idea. LLM's are very good at making English text, rewording things and understanding context. What if we gave an LLM a source (such as the ESV) and a backtranslation and said, "make more of the backtranslation using the ESV as a source." It could add explications, different contexts and immitate phrase reordering. Moreover, we could also add Bible reference material to the context and it should be able to make the source have better target context, mirroring what the existing backtranslations have, both scripturally and culturally.

We could take these newly generated "target aligned source " and then (optionally), could give them to the Translators and let them correct them to be more accurate to what they should say. After that optional step, we can feed it to a trained NLLB model that is only trained on backtranslation and target data and it would the spit out pretty close target data.

@ddaspit - what do you think?

johnml1135 · 2025-01-09T17:01:52Z

More ideas:

Use the LLM to revise the backtranslation to be better English - more natural
Train an NLLB200 model to do this "source to backtranslation" work, instead of using a LLM
Keep all of this behind the scenes for the translator - we are just using an "enhanced source". Therefore the user will not have to change their process.
Concern: The LLM will hallucinate or soften Biblical passages (and add whatever tech bias is in there) and that could be subtlety added to the Bible.
Add Biblical resources in the context window as a "source of truth" to ensure that the "enhanced source" has the desired added context.
Create multiple "enhanced sources" that are (1) middle of the road, (2) more literal, (3) more dynamic, (4) more context.

How would we prompt this for the LLM? What would we tell the LLM? What would we put in the context window?

Should we fine tune a 7GB model each time? A 70GB model (using Apollo) once on the 2 H100's? or inference off of a 400 GB model? Which would give the best results?

Re-enforcement learning of using a whole bunch of translations and their back-translations?

A way to do a "spike" without LLM's
Compare the Bleu score for 4 different generation types. Assume we have a full target Bible and full backtranslation.

Fine tune the NLLB200 on the target to the ESV. Generate pretranslations from the ESV.
Fine tune the NLLB200 on the target to a mixed source of ESV and back translation. Generate pretranslations from the ESV.
Fine tune the NLLB200 on the target to the backtranslation. Generate pretranslations from the backtranslation.
Fine tune the NLLB200 on the target to the backtranslation. Generate pretranslations from "LLM enhanced source".

For each type, use the following amounts of training data:
A. Mark
B. 1/4 of NT books
C. 1/2 of NT books
D. All NT books
E. All books of the Bible but 1

This test would show us the upper limit (type 3) for this concept - both if it helps with partial NT's (crossbow) or throughout the whole Bible.

We need to find a set of Bibles with back-translations to be used as references for these experiments.

johnml1135 · 2025-01-09T17:22:51Z

Recommendation - get at least 5 Bibles and do types 1-3 (no LLM) and use that data to direct and prioritize future work.

johnml1135 · 2025-01-09T18:58:04Z

@woodwardmw - this may be interesting to you as well. I don't know if you want to test it out.

woodwardmw · 2025-01-09T19:02:50Z

Yeah, very interesting. I like the idea of training on the back translation and then creating extra "back translation" to use as a source for inference. As long as it can be generated without going too far from the actual Bible text.

My feeling is that the way forward in general is to keep the current NLLB (or MADLAD) model as the main translation engine, and to focus on LLM pre- and post-processing to improve results.

johnml1135 changed the title ~~Use big LLM to convert source text to backtranslation~~ Use big LLM to better align source to target context Jan 9, 2025

johnml1135 changed the title ~~Use big LLM to better align source to target context~~ Use big LLM to better align source and enhance source corpora Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use big LLM to better align source and enhance source corpora #622

Use big LLM to better align source and enhance source corpora #622

johnml1135 commented Jan 8, 2025 •

edited

Loading

johnml1135 commented Jan 9, 2025 •

edited

Loading

johnml1135 commented Jan 9, 2025

johnml1135 commented Jan 9, 2025

woodwardmw commented Jan 9, 2025

Use big LLM to better align source and enhance source corpora #622

Use big LLM to better align source and enhance source corpora #622

Comments

johnml1135 commented Jan 8, 2025 • edited Loading

johnml1135 commented Jan 9, 2025 • edited Loading

johnml1135 commented Jan 9, 2025

johnml1135 commented Jan 9, 2025

woodwardmw commented Jan 9, 2025

johnml1135 commented Jan 8, 2025 •

edited

Loading

johnml1135 commented Jan 9, 2025 •

edited

Loading