No support for different sample types in same data format #122

wvaske · 2023-12-06T20:49:06Z

Currently a data format (npc, tfrecord, etc) maps to a single sample type (image, text, etc).

We will need a method for the various container formats to support different sample types. I would recommend adding an abstraction for SampleType that describes how to create or process a type of record. A reader would use the SampleType instead of using hard coded methods.

For example: TFRecord only supports images but it should support samples used for DLRM as well.

hariharan-devarajan · 2023-12-07T09:29:21Z

For any format, what the bytes represent is non-consequential. Sure there is an in-memory encoding into images but the samples themselves on disk are just bytes.

So if u have text u can store the bytes, it would not change the access pattern within TFRecord. Just the decoding of protobuf would change which is essentially not I/O.

Thoughts? @wvaske

wvaske · 2023-12-08T17:07:38Z

In general, I'd like to agree but I'm looking at DLRM and it's sort of a unique case.

One popular data format for DLRM data is parquet. Since it's columnar the size of each data type determines data layout. Also, since DLRM is model parallel different accelerators will read different columns (instead of a single accelerator reading from every column for a single data sample). I don't know how to support this functionality in the current framework without hardcoding a bunch of values.

hariharan-devarajan · 2023-12-12T01:15:23Z

For parquet reading, the granularity of reading is generally row-groups. So we can emulate different column types using bytes dtype. So we would need what does the columns represent?

so we can have a new Param in dataset called types: [size of each column]. Add a validation to use this for parquet only for now. And then emulate it using Bytes type in parquet.

@zhenghh04 @wvaske Thoughts?

wvaske · 2023-12-14T22:15:55Z

That seems reasonable. Good thinking!

hariharan-devarajan self-assigned this Dec 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No support for different sample types in same data format #122

No support for different sample types in same data format #122

wvaske commented Dec 6, 2023

hariharan-devarajan commented Dec 7, 2023

wvaske commented Dec 8, 2023

hariharan-devarajan commented Dec 12, 2023

wvaske commented Dec 14, 2023

No support for different sample types in same data format #122

No support for different sample types in same data format #122

Comments

wvaske commented Dec 6, 2023

hariharan-devarajan commented Dec 7, 2023

wvaske commented Dec 8, 2023

hariharan-devarajan commented Dec 12, 2023

wvaske commented Dec 14, 2023