How do I train on a translation dataset? (eg. {"english": "...", "chinese": "...") #1046

Nero10578 · 2024-01-05T09:05:59Z

Nero10578
Jan 5, 2024

I am trying to fine tune mistral 7B on Sundanese and I am confused at how exactly did the Chinese-Alpaca fine tuners trained the model on their translation training dataset that can be found here: https://github.com/ymcui/Chinese-LLaMA-Alpaca/wiki/Training-Details#Training-Data

They showed an example dataset structure like so:
{"english": "In Italy, there is no real public pressure for a new, fairer tax system.", "chinese": "在意大利，公众不会真的向政府施压，要求实行新的、更公平的税收制度。"}

I ofcourse don't see the option of a dataset format that fits that, so do I just use completion raw corpus dataset type and put that inside the text key? Like so:

{"text": ""english": "In Italy, there is no real public pressure for a new, fairer tax system.", "chinese": "在意大利，公众不会真的向政府施压，要求实行新的、更公平的税收制度。""}

I am ofcourse going to use my own Sundanese dataset that I've made, but I am asking to figure out what's the best way to do this, so the model learns the translations the best. Any help would be appreciated! Thank you.

Answered by NanoCode012

Feb 23, 2024

In alpaca dataset, there are examples where the instruction is "Translate the following input from __ to ___"..

View full answer

NanoCode012 · 2024-02-23T17:43:33Z

NanoCode012
Feb 23, 2024
Collaborator

In alpaca dataset, there are examples where the instruction is "Translate the following input from __ to ___"..

2 replies

Nero10578 Feb 23, 2024
Author

Thanks! Will try that but so far already getting success just giving it labeled raw text of each language. As well as just typical instruction dataset telling it to translate like in dolphin.

NanoCode012 Feb 28, 2024
Collaborator

Could you elaborate what you mean by labeled raw text?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I train on a translation dataset? (eg. {"english": "...", "chinese": "...") #1046

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

How do I train on a translation dataset? (eg. {"english": "...", "chinese": "...") #1046

Nero10578 Jan 5, 2024

Replies: 1 comment · 2 replies

NanoCode012 Feb 23, 2024 Collaborator

Nero10578 Feb 23, 2024 Author

NanoCode012 Feb 28, 2024 Collaborator

Nero10578
Jan 5, 2024

Replies: 1 comment 2 replies

NanoCode012
Feb 23, 2024
Collaborator

Nero10578 Feb 23, 2024
Author

NanoCode012 Feb 28, 2024
Collaborator