Should one use json or jsonl? 100x memory usage when loading json #1220

RuABraun · 2023-11-21T20:03:49Z

RuABraun
Nov 21, 2023

Anyways training with webdatasets so thought it'd be best to use json (feature of being ab;e to partially load a jsonl not needed) but when I try and load a 5GB json.gz cutset it takes 100x the memory. Now wondering whether I should use jsonl instead? I didn't see a suggestion in the docs.

Any tips regarding how to load and write jsons quickly would be appreciated.

Answered by pzelasko

Nov 22, 2023

I recommend JSONL, every other format is problematic in some way. We may remove JSON support at some point or make it actually use JSONL.

View full answer

pzelasko · 2023-11-22T01:36:01Z

pzelasko
Nov 22, 2023
Maintainer

I recommend JSONL, every other format is problematic in some way. We may remove JSON support at some point or make it actually use JSONL.

11 replies

RuABraun Dec 13, 2023
Author

It seems to me the main bottleneck is creating a dict for each line. If I use orjson to write it's muuuch faster, but then one can't read them because lhotse expects the "type" field. :(

pzelasko Dec 13, 2023
Maintainer

If you use cut.to_dict() and then write that with orjson it should be ok (it will add type).

lhotse/lhotse/cut/base.py

Lines 203 to 205 in d474be6

    
           def to_dict(self) -> dict: 
        
               d = asdict_nonull(self) 
        
               return {**d, "type": type(self).__name__}

pzelasko Dec 13, 2023
Maintainer

(see also lhotse.serialization.save_to_jsonl/CutSet.to_file(".../abc.jsonl[.gz]"))

RuABraun Dec 13, 2023
Author

Thanks, that does help, timings for a smaller cutset:

to_file() : 80s
to_dict() -> orjson : 50s
orjson: 20s

But one can see from the above results that to_dict() is an expensive step.

To add some extra info the reason I'm investigating this is because for some datasets I'm using to_file() takes many hours.

pzelasko Dec 14, 2023
Maintainer

Yeah dataclass -> dict conversion is tricky... I'll try to think if there's an opportunity to further optimize but I can't make any promises. Take a look at Dan's suggestions it may help you overcome this issue in a different way.

danpovey · 2023-12-14T02:50:04Z

danpovey
Dec 14, 2023
Collaborator

There was a direction that I was hoping we could go in order to easily support very large datasets, which is to start using stateless samplers (that always select elements totally i.i.d. with no attempt to avoid duplicates), and the sampler would be initialized with a list of manifests; the sampling procedure would be to randomly select a manifest then randomly select an item. I believe Piotr implemented this. To make it efficient you need to create a file that tells you the start-char of each line in the json file; I believe this was done. Then, there isn't a strong need to process the entire dataset into a single jsonl file; you can just have it be a list of jsonl files and they don't even have to be homogeneous in any sense. I suspect all the tools to do this already exist although I forget the class names / PR numbers. The only thing is, if you restart training you have to reset the seed to something that depends on the (restarted) batch number so that it doesn't use the exact same items as you used on batch 0.

The idea would be that generally speaking we wouldn't use the concept of epoch for training, but just the batch number; see the Eden2 learning rate schedule, which supports this, if you were previously using the (epoch-dependent) Eden.

1 reply

pzelasko Dec 14, 2023
Maintainer

The mechanism Dan mentioned is implemented here:

lhotse/lhotse/dataset/sampling/stateless.py

Line 20 in d474be6

class StatelessSampler(torch.utils.data.Sampler, Dillable):

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Should one use json or jsonl? 100x memory usage when loading json #1220

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 12 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Should one use json or jsonl? 100x memory usage when loading json #1220

RuABraun Nov 21, 2023

Replies: 2 comments · 12 replies

pzelasko Nov 22, 2023 Maintainer

RuABraun Dec 13, 2023 Author

pzelasko Dec 13, 2023 Maintainer

pzelasko Dec 13, 2023 Maintainer

RuABraun Dec 13, 2023 Author

pzelasko Dec 14, 2023 Maintainer

danpovey Dec 14, 2023 Collaborator

pzelasko Dec 14, 2023 Maintainer

RuABraun
Nov 21, 2023

Replies: 2 comments 12 replies

pzelasko
Nov 22, 2023
Maintainer

RuABraun Dec 13, 2023
Author

pzelasko Dec 13, 2023
Maintainer

pzelasko Dec 13, 2023
Maintainer

RuABraun Dec 13, 2023
Author

pzelasko Dec 14, 2023
Maintainer

danpovey
Dec 14, 2023
Collaborator

pzelasko Dec 14, 2023
Maintainer