pcat.search(...).to_dataset_dict() sometimes slower than it should #253

coxipi · 2023-09-08T15:04:05Z

Setup Information

xscen version: 0.6.12-beta
Python version: 3.11.4
Operating System: CentOs 7 (Doris)

Context

I store my files on "jarre", which is considered a slow disk AFAIK.

Sometimes, pcat.search(...).to_dataset_dict() will take forever to access my files (bad behaviour), while this homemade function:

def my_search(kwargs):
    paths = list(pcat.search(**kwargs).df.path)
    return {p:xr.open_zarr(p) for p in paths}

has a speed which is similar to the good expected behaviour of pcat.search(...).to_dataset_dict() .

I can't tell what conditions on the server could be related to this problem. The problem sometimes comes, stays for a bit, and then stops.

Is this issue known?

The text was updated successfully, but these errors were encountered:

RondeauG · 2023-09-08T15:15:58Z

to_dataset_dict() does more than just open the files. It groups together the files associated to a given dataset based on aggregation controls specified in the JSON (by default: id, processing_level, domain, frequency). There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

For very big catalogs, I could thus see a substantial difference in speed compared to simply opening the files.

That being said, we could see if there are speedups to be accomplished.

aulemahal · 2023-09-08T15:37:11Z

@coxipi Is your catalog supposed to have aggregation, or is it indeed just a list of independent datasets ?

The aggregation can often be sped up with passing there to to_dataset_dict:

xarray_combine_by_coords_kwargs={'data_vars': 'minimal', 'coords': 'minimal', 'compat': 'override'}

assuming all the elements to be aggregated are well behaved (no overlap between files, all variables of the same name have the same dimensions and the exact same coordinates on the non-appended dims, etc).

coxipi · 2023-09-08T15:49:01Z

Not sure what you mean by "independent datasets". Each key in the dataset dict represents a different simulation (each with its own single path to a zarr) as created in previous steps of the xscen workflow.

aulemahal · 2023-09-08T15:51:12Z

I meant that they are not meant to be unified into a single dataset in the same way a open_mfdataset would act.

In that case, I'm not sure why to_dataset_dict would be dramatically slower than your function...

aulemahal · 2023-09-08T15:53:20Z

There's also a semi-custom call to open_dataset --> combine_by_coords, instead of open_mfdatasets, although I don't quite remember their reasoning behind it.

@RondeauG, in to_dataset_dict the aggregation is entirely driven by the catalog columns and configuration. In open_mfdataset, the aggregation is guessed by xarray by analyzing the coordinates.

Note: if the path column contains a *, open_mfdataset will be used, so one can combine both methods.

coxipi changed the title ~~Homemade pcat search function sometimes faster~~ pcat.search(...).to_dataset_dict() sometimes slower than it should Sep 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

coxipi commented Sep 8, 2023 •

edited

Loading

RondeauG commented Sep 8, 2023 •

edited

Loading

aulemahal commented Sep 8, 2023

coxipi commented Sep 8, 2023

aulemahal commented Sep 8, 2023

aulemahal commented Sep 8, 2023

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

pcat.search(...).to_dataset_dict() sometimes slower than it should #253

Comments

coxipi commented Sep 8, 2023 • edited Loading

Setup Information

Context

RondeauG commented Sep 8, 2023 • edited Loading

aulemahal commented Sep 8, 2023

coxipi commented Sep 8, 2023

aulemahal commented Sep 8, 2023

aulemahal commented Sep 8, 2023

coxipi commented Sep 8, 2023 •

edited

Loading

RondeauG commented Sep 8, 2023 •

edited

Loading