-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inferring the target types during benchmark creation is slow with Zarr datasets #154
Comments
Could you verify that you're using the latest Polaris release? We significantly sped up the benchmark creation in #148. |
The version 7.4 took 46 min for benchmark creation for a dataset with 625229 rows. |
I'm going to run a profiler against the above |
When running a profiler against the creation of a
I get the following results: As we can see, pretty much all of the overhead is coming from the target types validator in the The loading of the Zarr chunks into memory seems to take the vast majority of the time, and it appears to be because of how we wrote the If so, this seems to be where all the overhead is coming from (fetching and decoding the Zarr chunk for each index). If we can write another method to handle fetching an entire Zarr column where it fetches all values within a chunk at once, it could likely greatly speed this process up. |
Thanks @Andrewq11 and @zhu0619 ! Every chunk access in Zarr indeed comes at a penalty: You do a data copy of that chunk and you decompress the chunk. Only using a small piece of data in a chunk is thus wasteful. Unfortunately, I don't think we can assume that we can always load an entire column in memory. A single column may be gigabytes. The more structural solution would be to make an appropriate choice of chunk size and the compression level during dataset creation. For example, if every single float is its own chunk and we disable compression, there wouldn't be as much of a performance penalty (this leads to some other issues due to the high number of files it creates, but Zarr supports sharding for this. Note, however, that this feature is still experimental!). For datasets that have a poorly chosen chunk size, we could support rechunking the dataset locally as an optimization step a user can do. Another way we could improve this is - as @Andrewq11 suggested - by improving the access pattern. If you load some data from a chunk, you try use as much data from that chunk as possible. This is e.g. relevant for data loaders, where out-of-core ML is more efficient than random access. I do think we can apply it here too, but it may get complex... I wrote two docs with some experiments about Zarr performance here and here. The main outcomes from these docs are also documented in the Zarr tutorial. |
I'm circling back to this issue again as I'm planning for the support of ultra-large datasets. I think I initially misunderstood the issue. Reading your comments again now, @Andrewq11 is correct that this line of code is incredibly poorly written (my bad!). In theory, I agree with the proposed solution of handling the case of loading an entire column separately. To solve the issue @zhu0619 originally raised, however, I think we could sample only a subset (e.g. the first 100 rows) to infer the target type. Once we get to datasets with a billion rows, we can no longer assume we can load an entire column into memory. |
@zhu0619 To provide an even more immediate solution, you can simply specify the target types yourself to prevent the benchmark from trying to automatically infer it. |
Is your feature request related to a problem? Please describe.
Recently, as more large datasets have been added to the hub, the creation of
Benchmark
object based on these datasets has become increasingly time-consuming. This is primarily due to the slow performance of Pydantic validators and other model initialization functions, such as checksum calculation and Zarr processing.An example:
Describe the solution you'd like
The text was updated successfully, but these errors were encountered: