Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferring the target types during benchmark creation is slow with Zarr datasets #154

Open
zhu0619 opened this issue Jul 23, 2024 · 8 comments
Labels
enhancement New feature or request feature Annotates any PR that adds new features; Used in the release process
Milestone

Comments

@zhu0619
Copy link
Contributor

zhu0619 commented Jul 23, 2024

Is your feature request related to a problem? Please describe.

Recently, as more large datasets have been added to the hub, the creation of Benchmark object based on these datasets has become increasingly time-consuming. This is primarily due to the slow performance of Pydantic validators and other model initialization functions, such as checksum calculation and Zarr processing.

An example:
image

Describe the solution you'd like

  • Identify the validators and functions that contribute to the slowdown
  • Increase the efficiency of the identified modules.
@zhu0619 zhu0619 added enhancement New feature or request feature Annotates any PR that adds new features; Used in the release process labels Jul 23, 2024
@zhu0619 zhu0619 changed the title Accelerate the Accelerate the benchmark creation for large datasets Jul 23, 2024
@cwognum
Copy link
Collaborator

cwognum commented Jul 23, 2024

Could you verify that you're using the latest Polaris release? We significantly sped up the benchmark creation in #148.

@zhu0619
Copy link
Contributor Author

zhu0619 commented Jul 23, 2024

Could you verify that you're using the latest Polaris release? We significantly sped up the benchmark creation in #148.

The version 7.4 took 46 min for benchmark creation for a dataset with 625229 rows.

@Andrewq11
Copy link
Contributor

I'm going to run a profiler against the above BenchmarkSpecification and see what is taking the longest. The profiler adds significant overhead and if this took ~50 minutes to run, the profiler will probably be done by tonight.

@Andrewq11
Copy link
Contributor

When running a profiler against the creation of a SingleTaskBenchmarkSpecification object with this code (similar to what @zhu0619 was using):

dataset = po.load_dataset("mlls/bend-histone-modification")
dataset.cache()
split = [
    np.where(~dataset.table['chromosome'].isin(['chr1', 'chr8', 'chr9', 'chr2', 'chr4']))[0].tolist(),
    {
    "test" :  np.where(dataset.table['chromosome'].isin(['chr1', 'chr8', 'chr9']))[0].tolist(),
    "valid" : np.where(dataset.table['chromosome'].isin(['chr2', 'chr4']))[0].tolist(),
}
]

benchmark = SingleTaskBenchmarkSpecification(
    name='bend_histone_modification',
    dataset=dataset,
    target_cols="label",
    input_cols=["sequence"],
    split=split,
    metrics="roc_auc",
)

I get the following results:

Screenshot 2024-07-25 at 10 15 17 AM

As we can see, pretty much all of the overhead is coming from the target types validator in the BenchmarkSpecification class, and that is because of this line. This is usually fine for non-Zarr datasets, but the dataset used in the above example uses Zarr for the single target column. Fetching this single, entire Zarr column is the issue.

The loading of the Zarr chunks into memory seems to take the vast majority of the time, and it appears to be because of how we wrote the __getitem__ method in the Dataset class. Cas (purposeful no ping), is it correct that in this current implementation and context, we are decoding the Zarr chunk from storage for each target value in the dataset (when retrieving the entire Zarr target column as shown above)?

If so, this seems to be where all the overhead is coming from (fetching and decoding the Zarr chunk for each index). If we can write another method to handle fetching an entire Zarr column where it fetches all values within a chunk at once, it could likely greatly speed this process up.

@Andrewq11
Copy link
Contributor

As a bit of confirmation for fetching all values within a single Zarr chunk at once under the scenario where a target column is stored via Zarr, I quickly wrote up this function to test how impactful it could be:

    def _fetch_entire_zarr_column(self, col):
        value = self.table.at[0, col]

        path, index = self._split_index_from_path(value)
        arr = self.zarr_data[path]
        data = arr[:]

        return data

It's not pretty, this is just an example. The following are the results for creating the same SingleTaskBenchmarkSpecification as above:

Screenshot 2024-07-25 at 12 21 44 PM

In summary, it seems to do the trick (assuming I'm not missing something else, of course). We should strongly consider using something like this when it is known we need to fetch all values within a Zarr column. If no objections, I can give it a go soon. Would be great to have Cas' thoughts here, but I won't ping him.

@cwognum
Copy link
Collaborator

cwognum commented Jul 31, 2024

Thanks @Andrewq11 and @zhu0619 !

Every chunk access in Zarr indeed comes at a penalty: You do a data copy of that chunk and you decompress the chunk. Only using a small piece of data in a chunk is thus wasteful.

Unfortunately, I don't think we can assume that we can always load an entire column in memory. A single column may be gigabytes.

The more structural solution would be to make an appropriate choice of chunk size and the compression level during dataset creation. For example, if every single float is its own chunk and we disable compression, there wouldn't be as much of a performance penalty (this leads to some other issues due to the high number of files it creates, but Zarr supports sharding for this. Note, however, that this feature is still experimental!).

For datasets that have a poorly chosen chunk size, we could support rechunking the dataset locally as an optimization step a user can do.

Another way we could improve this is - as @Andrewq11 suggested - by improving the access pattern. If you load some data from a chunk, you try use as much data from that chunk as possible. This is e.g. relevant for data loaders, where out-of-core ML is more efficient than random access. I do think we can apply it here too, but it may get complex...

I wrote two docs with some experiments about Zarr performance here and here. The main outcomes from these docs are also documented in the Zarr tutorial.

@cwognum
Copy link
Collaborator

cwognum commented Aug 7, 2024

I'm circling back to this issue again as I'm planning for the support of ultra-large datasets.

I think I initially misunderstood the issue. Reading your comments again now, @Andrewq11 is correct that this line of code is incredibly poorly written (my bad!). In theory, I agree with the proposed solution of handling the case of loading an entire column separately.

To solve the issue @zhu0619 originally raised, however, I think we could sample only a subset (e.g. the first 100 rows) to infer the target type. Once we get to datasets with a billion rows, we can no longer assume we can load an entire column into memory.

@cwognum cwognum changed the title Accelerate the benchmark creation for large datasets Inferring the target types during benchmark creation is slow with Zarr datasets Aug 7, 2024
@cwognum
Copy link
Collaborator

cwognum commented Aug 7, 2024

@zhu0619 To provide an even more immediate solution, you can simply specify the target types yourself to prevent the benchmark from trying to automatically infer it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature Annotates any PR that adds new features; Used in the release process
Projects
None yet
Development

No branches or pull requests

3 participants