Best practices for long raw text datapoints #1001

SlimeQ · 2023-12-25T21:18:24Z

SlimeQ
Dec 25, 2023

I'm a noob who recently upgraded from training Loras in oobabooga text-generation-webui. I have mostly migrated my pipeline to axolotl now and I'm getting trained models out, but they seem very off and don't really respond the way I expect based on my older models.

Now, in ooba, there's parameters you can set where it will take an arbitrarily long text file and auto generate a bunch of N-length chunks that overlap by M tokens. I don't see that anywhere in the axolotl config, so I'm wondering if that has to do with the issues I'm noticing.

I just can't really find any information whatsoever about how axolotl is chunking the data. Is it truncating it? Blindly splitting it by sequence length? Something else?

To be clear I'm using a local completion dataset in my yml, and the dataset consists of about 100 txt files that are 4,000 to 200,000 tokens in length. What is the correct way to handle this?

NanoCode012 · 2024-02-23T17:53:41Z

NanoCode012
Feb 23, 2024
Collaborator

Hey, chunking was recently added in the past few months for completion.

When you do a training with type: completion it will split your text by value of sequence_len. There's no overlap chunking, so I would recommend running a simple python script to chunk yourself.

2 replies

kheyer Mar 8, 2024

Do you know what version this was implemented? I'm using 0.4.0 with a completion dataset with long sequences. I'm seeing dropping long sequences printed during preprocessing. Is there a way to verify if chunking is taking place or if the sequences are being discarded?

NanoCode012 Mar 8, 2024
Collaborator

It has been implemented prior to 4.0.

that message is just an info log to tell that it’s looping to check for long seq to drop.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Best practices for long raw text datapoints #1001

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Best practices for long raw text datapoints #1001

SlimeQ Dec 25, 2023

Replies: 1 comment · 2 replies

NanoCode012 Feb 23, 2024 Collaborator

kheyer Mar 8, 2024

NanoCode012 Mar 8, 2024 Collaborator

SlimeQ
Dec 25, 2023

Replies: 1 comment 2 replies

NanoCode012
Feb 23, 2024
Collaborator

NanoCode012 Mar 8, 2024
Collaborator