Inferring the target types during benchmark creation is slow with Zarr datasets #154

zhu0619 · 2024-07-23T19:02:52Z

Is your feature request related to a problem? Please describe.

Recently, as more large datasets have been added to the hub, the creation of Benchmark object based on these datasets has become increasingly time-consuming. This is primarily due to the slow performance of Pydantic validators and other model initialization functions, such as checksum calculation and Zarr processing.

An example:

Describe the solution you'd like

Identify the validators and functions that contribute to the slowdown
Increase the efficiency of the identified modules.

The text was updated successfully, but these errors were encountered:

cwognum · 2024-07-23T19:40:04Z

Could you verify that you're using the latest Polaris release? We significantly sped up the benchmark creation in #148.

zhu0619 · 2024-07-23T21:40:18Z

Could you verify that you're using the latest Polaris release? We significantly sped up the benchmark creation in #148.

The version 7.4 took 46 min for benchmark creation for a dataset with 625229 rows.

Andrewq11 · 2024-07-24T17:47:12Z

I'm going to run a profiler against the above BenchmarkSpecification and see what is taking the longest. The profiler adds significant overhead and if this took ~50 minutes to run, the profiler will probably be done by tonight.

Andrewq11 · 2024-07-25T15:34:16Z

When running a profiler against the creation of a SingleTaskBenchmarkSpecification object with this code (similar to what @zhu0619 was using):

dataset = po.load_dataset("mlls/bend-histone-modification")
dataset.cache()
split = [
    np.where(~dataset.table['chromosome'].isin(['chr1', 'chr8', 'chr9', 'chr2', 'chr4']))[0].tolist(),
    {
    "test" :  np.where(dataset.table['chromosome'].isin(['chr1', 'chr8', 'chr9']))[0].tolist(),
    "valid" : np.where(dataset.table['chromosome'].isin(['chr2', 'chr4']))[0].tolist(),
}
]

benchmark = SingleTaskBenchmarkSpecification(
    name='bend_histone_modification',
    dataset=dataset,
    target_cols="label",
    input_cols=["sequence"],
    split=split,
    metrics="roc_auc",
)

I get the following results:

As we can see, pretty much all of the overhead is coming from the target types validator in the BenchmarkSpecification class, and that is because of this line. This is usually fine for non-Zarr datasets, but the dataset used in the above example uses Zarr for the single target column. Fetching this single, entire Zarr column is the issue.

The loading of the Zarr chunks into memory seems to take the vast majority of the time, and it appears to be because of how we wrote the __getitem__ method in the Dataset class. Cas (purposeful no ping), is it correct that in this current implementation and context, we are decoding the Zarr chunk from storage for each target value in the dataset (when retrieving the entire Zarr target column as shown above)?

If so, this seems to be where all the overhead is coming from (fetching and decoding the Zarr chunk for each index). If we can write another method to handle fetching an entire Zarr column where it fetches all values within a chunk at once, it could likely greatly speed this process up.

Andrewq11 · 2024-07-25T16:25:21Z

As a bit of confirmation for fetching all values within a single Zarr chunk at once under the scenario where a target column is stored via Zarr, I quickly wrote up this function to test how impactful it could be:

    def _fetch_entire_zarr_column(self, col):
        value = self.table.at[0, col]

        path, index = self._split_index_from_path(value)
        arr = self.zarr_data[path]
        data = arr[:]

        return data

It's not pretty, this is just an example. The following are the results for creating the same SingleTaskBenchmarkSpecification as above:

In summary, it seems to do the trick (assuming I'm not missing something else, of course). We should strongly consider using something like this when it is known we need to fetch all values within a Zarr column. If no objections, I can give it a go soon. Would be great to have Cas' thoughts here, but I won't ping him.

cwognum · 2024-07-31T11:09:48Z

Thanks @Andrewq11 and @zhu0619 !

Every chunk access in Zarr indeed comes at a penalty: You do a data copy of that chunk and you decompress the chunk. Only using a small piece of data in a chunk is thus wasteful.

Unfortunately, I don't think we can assume that we can always load an entire column in memory. A single column may be gigabytes.

The more structural solution would be to make an appropriate choice of chunk size and the compression level during dataset creation. For example, if every single float is its own chunk and we disable compression, there wouldn't be as much of a performance penalty (this leads to some other issues due to the high number of files it creates, but Zarr supports sharding for this. Note, however, that this feature is still experimental!).

For datasets that have a poorly chosen chunk size, we could support rechunking the dataset locally as an optimization step a user can do.

Another way we could improve this is - as @Andrewq11 suggested - by improving the access pattern. If you load some data from a chunk, you try use as much data from that chunk as possible. This is e.g. relevant for data loaders, where out-of-core ML is more efficient than random access. I do think we can apply it here too, but it may get complex...

I wrote two docs with some experiments about Zarr performance here and here. The main outcomes from these docs are also documented in the Zarr tutorial.

cwognum · 2024-08-07T19:00:46Z

I'm circling back to this issue again as I'm planning for the support of ultra-large datasets.

I think I initially misunderstood the issue. Reading your comments again now, @Andrewq11 is correct that this line of code is incredibly poorly written (my bad!). In theory, I agree with the proposed solution of handling the case of loading an entire column separately.

To solve the issue @zhu0619 originally raised, however, I think we could sample only a subset (e.g. the first 100 rows) to infer the target type. Once we get to datasets with a billion rows, we can no longer assume we can load an entire column into memory.

cwognum · 2024-08-07T19:03:32Z

@zhu0619 To provide an even more immediate solution, you can simply specify the target types yourself to prevent the benchmark from trying to automatically infer it.

zhu0619 added enhancement New feature or request feature Annotates any PR that adds new features; Used in the release process labels Jul 23, 2024

zhu0619 changed the title ~~Accelerate the~~ Accelerate the benchmark creation for large datasets Jul 23, 2024

cwognum changed the title ~~Accelerate the benchmark creation for large datasets~~ Inferring the target types during benchmark creation is slow with Zarr datasets Aug 7, 2024

cwognum modified the milestones: XL Datasets, XL Benchmarks Aug 13, 2024

cwognum mentioned this issue Aug 13, 2024

Partially skip data validation for artifacts downloaded from the Hub #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferring the target types during benchmark creation is slow with Zarr datasets #154

Inferring the target types during benchmark creation is slow with Zarr datasets #154

zhu0619 commented Jul 23, 2024

cwognum commented Jul 23, 2024

zhu0619 commented Jul 23, 2024 •

edited

Loading

Andrewq11 commented Jul 24, 2024

Andrewq11 commented Jul 25, 2024

Andrewq11 commented Jul 25, 2024

cwognum commented Jul 31, 2024 •

edited

Loading

cwognum commented Aug 7, 2024

cwognum commented Aug 7, 2024

Inferring the target types during benchmark creation is slow with Zarr datasets #154

Inferring the target types during benchmark creation is slow with Zarr datasets #154

Comments

zhu0619 commented Jul 23, 2024

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

cwognum commented Jul 23, 2024

zhu0619 commented Jul 23, 2024 • edited Loading

Andrewq11 commented Jul 24, 2024

Andrewq11 commented Jul 25, 2024

Andrewq11 commented Jul 25, 2024

cwognum commented Jul 31, 2024 • edited Loading

cwognum commented Aug 7, 2024

cwognum commented Aug 7, 2024

zhu0619 commented Jul 23, 2024 •

edited

Loading

cwognum commented Jul 31, 2024 •

edited

Loading