LoRA training is freezing whenever using a formatted dataset #1586

ClayShoaf · 2023-04-26T21:15:43Z

ClayShoaf
Apr 26, 2023

I don't really want to submit an issue, because it might be something on my end. If I use raw text, it works fine, but if I try using any kind of json file, my whole system freezes until eventually (half hour or so) the program is killed by the system (running on linux). My data is formatted like this:

[
    {
        "key": "training text"
    },
    {
        "key": "training text"
    },
    {
        "key": "training text"
    }
]

And my formats file looks like this:

{
    "key": "%key%"
}

@kaiokendev when you trained your LoRA, did you use the training tab in ooba, or something else? Did you ever have this problem?

oobabooga · 2023-04-27T00:43:22Z

oobabooga
Apr 27, 2023
Maintainer

cc @mcmonkey4eva

0 replies

kaiokendev · 2023-04-27T03:46:32Z

kaiokendev
Apr 27, 2023

I did not use the ooba trainer, no. I used colab

0 replies

mcmonkey4eva · 2023-04-27T04:32:45Z

mcmonkey4eva
Apr 27, 2023

Freeze until killed by system sounds like OOM. How big is the dataset?

EDIT: Tested locally, got an OOM too. I think something got broken in the loader code?

EDIT2:

        def generate_prompt(data_point: dict[str, str]):
            print(f"Generating prompt for data-point: {data_point}")
    (...)
        def generate_and_tokenize_prompt(data_point):
            prompt = generate_prompt(data_point)
            return tokenize(prompt)

        print("Loading JSON datasets...")
        data = load_dataset("json", data_files=clean_path('training/datasets', f'{dataset}.json'))
        print("Start mapping it")
        train_data = data['train'].map(generate_and_tokenize_prompt)

prints Loading JSON datasets... then some hf debug then Start mapping it... then freezes and dies, with system RAM rapidly spiking up.
No further output, the other print is never shown.
This is a bug in HF Datasets code.

Our options:
A: ignore it / hope HF fixes it
B: look into the internals of hf datasets and see if we can fix it
C: burn the datasets code and just make our own json data loader since that's not even any effort to do in the first place anyway.

0 replies

ClayShoaf · 2023-04-27T18:10:31Z

ClayShoaf
Apr 27, 2023
Author

How big is the dataset?

I see you were able to reproduce the problem, but to answer this question, it was literally the exact text that I put in my first post.

I originally was hoping it was just that the dataset was too big for my computer to handle. I tried using zetavg/LLaMA-LoRA-Tuner and it was able to parse the dataset, so maybe they are using an older version of the HF code, or maybe they're using something different. I'm kind of busy this week, so I don't have time to step through all of the code, but I did try to step through it in the webui code to see if I could figure out exactly where it was hanging. I got to parts where I couldn't understand what was actually happening, then I just held down the step button for a while and let it slowly run through the code. The last thing I remember was being in pickle.py and something about a dill object. I know that's not very helpful and it's probably not really related to the hangup. When I get some more free time, I'll try to actually figure it out.

FWIW I am training in 8-bit, not 4.

0 replies

ClayShoaf · 2023-05-02T03:51:06Z

ClayShoaf
May 2, 2023
Author

I got it working. I'm a dummy and forgot to turn my swapfile on. It is weird watching the RAM usage crank up so high and then drop. Seem's like there has to be something that could be better optimized in there. I guess some loop is cranking out temporary variables.

0 replies

mcmonkey4eva · 2023-05-02T14:32:04Z

mcmonkey4eva
May 2, 2023

The RAM usage is unreasonably massive even on small datasets. 1 KiB of JSON should not be loading up my 64 GiB of RAM no matter how stupid the internal code is.

1 reply

ClayShoaf May 3, 2023
Author

I did switch to 4bit training with monkeypatch. I've spent all my free time lately trying to get a dataset formatted correctly. I just wanted to get a few LoRAs trained so I would have the motivation to add the option to the XY Grid. I haven't taken the time to really dive in to why it's not working, but I know that if I use alpaca format and monkeypatch, it doesn't spike over 30GB of RAM. That's been the extent of my troubleshooting for this particular issue.

loanMaster · 2023-05-12T09:07:43Z

loanMaster
May 12, 2023

I have the same issue. The process crashes when calling 'update_fingerprint' in 'arrorw_dataset.py'. 'update_fingerprint' is called in the 'map' function of the dataset, which is used in 'training.py'.

'update_fingerprint' calls a hasher to hash the function used for mapping text to tokens, which seems to lead to the problem.

As temporary workaround I used a custom fingerprint:

train_data = data['train'].map(generate_and_tokenize_prompt, new_fingerprint="somethingsomething")

in 'training.py' in the modules folder.

4 replies

mcmonkey4eva May 12, 2023

Does... does that actually work? wtf.

loanMaster May 12, 2023

It worked for me in a test run. I haven't checked but I guess generating a new fingerprint for each dataset is advised.

ClayShoaf May 12, 2023
Author

That's awesome. I can't look at it right now, but I would guess that the fingerprint is what is used for storing the cached data? If that's the reason to make it unique, it probably wouldn't be too hard to cat the dataset and format files together, run a sha256 on them, and then use that as the fingerprint.

loanMaster May 12, 2023

I would guess that the fingerprint is what is used for storing the cached data?

It is. For instance: cache_file_name = "cache-" + fingerprint + ".arrow" in 'arrow_dataset.py'. I haven't found any other use except for caching - although I haven't investigated in detail.

mcmonkey4eva · 2023-05-16T18:48:25Z

mcmonkey4eva
May 16, 2023

I'm losing my mind. I changed nothing and I can't replicate it anymore. I was going to test the fix loanMaster suggested but i can't replicate the "before" state of it overloading now. And I don't know why :(

4 replies

ClayShoaf May 16, 2023
Author

Are you using a custom template with something other than alpaca formatting?

mcmonkey4eva May 16, 2023

I've tried alpaca data in alpaca format, and custom data in custom templates. Everything just works now. My best guess is I mush have accidentally updated a relevant lib at some point and some upstream update fixed it.

ClayShoaf May 17, 2023
Author

I'll try a custom template tomorrow and if it works, I'll close this issue

Ph0rk0z May 17, 2023

I just trained 100k 2 days ago, made my own format and it worked flawless.

ClayShoaf · 2023-05-17T21:35:14Z

ClayShoaf
May 17, 2023
Author

It works now

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LoRA training is freezing whenever using a formatted dataset #1586

{{title}}

Replies: 9 comments 9 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

LoRA training is freezing whenever using a formatted dataset #1586

Replies: 9 comments · 9 replies

oobabooga Apr 27, 2023 Maintainer

ClayShoaf Apr 27, 2023 Author

ClayShoaf May 2, 2023 Author

ClayShoaf May 3, 2023 Author

ClayShoaf May 12, 2023 Author

ClayShoaf May 16, 2023 Author

ClayShoaf May 17, 2023 Author

ClayShoaf May 17, 2023 Author

Replies: 9 comments 9 replies

oobabooga
Apr 27, 2023
Maintainer

ClayShoaf
Apr 27, 2023
Author

ClayShoaf
May 2, 2023
Author

ClayShoaf May 3, 2023
Author

ClayShoaf May 12, 2023
Author

ClayShoaf May 16, 2023
Author

ClayShoaf May 17, 2023
Author

ClayShoaf
May 17, 2023
Author