Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No support for different sample types in same data format #122

Open
wvaske opened this issue Dec 6, 2023 · 4 comments
Open

No support for different sample types in same data format #122

wvaske opened this issue Dec 6, 2023 · 4 comments
Assignees

Comments

@wvaske
Copy link

wvaske commented Dec 6, 2023

Currently a data format (npc, tfrecord, etc) maps to a single sample type (image, text, etc).

We will need a method for the various container formats to support different sample types. I would recommend adding an abstraction for SampleType that describes how to create or process a type of record. A reader would use the SampleType instead of using hard coded methods.

For example: TFRecord only supports images but it should support samples used for DLRM as well.

@hariharan-devarajan
Copy link
Collaborator

For any format, what the bytes represent is non-consequential. Sure there is an in-memory encoding into images but the samples themselves on disk are just bytes.

So if u have text u can store the bytes, it would not change the access pattern within TFRecord. Just the decoding of protobuf would change which is essentially not I/O.

Thoughts? @wvaske

@hariharan-devarajan hariharan-devarajan self-assigned this Dec 7, 2023
@wvaske
Copy link
Author

wvaske commented Dec 8, 2023

In general, I'd like to agree but I'm looking at DLRM and it's sort of a unique case.

One popular data format for DLRM data is parquet. Since it's columnar the size of each data type determines data layout. Also, since DLRM is model parallel different accelerators will read different columns (instead of a single accelerator reading from every column for a single data sample). I don't know how to support this functionality in the current framework without hardcoding a bunch of values.

@hariharan-devarajan
Copy link
Collaborator

For parquet reading, the granularity of reading is generally row-groups. So we can emulate different column types using bytes dtype. So we would need what does the columns represent?

so we can have a new Param in dataset called types: [size of each column]. Add a validation to use this for parquet only for now. And then emulate it using Bytes type in parquet.

@zhenghh04 @wvaske Thoughts?

@wvaske
Copy link
Author

wvaske commented Dec 14, 2023

That seems reasonable. Good thinking!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants