Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantile static #32

Merged
merged 11 commits into from
Mar 26, 2024
Merged

Quantile static #32

merged 11 commits into from
Mar 26, 2024

Conversation

WillyChap
Copy link
Collaborator

This pull request provides access to the quantile transform scaler (via the bridgescaler package), It returns the exact tensors structure as our current scaling, so should cause no pipeline issues. I am optimistic it will help training alot.

Here is the use:

Dataset= ERA5Dataset(filenames=[FNS[nunu]],history_len=history_len,forecast_len=forecast_len,skip_periods=1,transform=transforms.Compose([
            NormalizeState_Quantile(scaler_file=conf["data"]["quant_path"]),
            ToTensor(history_len=history_len, forecast_len=forecast_len,static_variables=conf["data"]["static_vars"]),
        ]))
print(FNS[nunu])
BB_trancs_quant = Dataset.__getitem__(8784)

Dataset= ERA5Dataset(filenames=[FNS[nunu]],history_len=history_len,forecast_len=forecast_len,skip_periods=1,transform=transforms.Compose([
            NormalizeState(conf["data"]["mean_path"], conf["data"]["std_path"]),
            ToTensor(history_len=history_len, forecast_len=forecast_len,static_variables=conf["data"]["static_vars"]),
        ]))
print(FNS[nunu])
BB_trancs_std = Dataset.__getitem__(8784)

I am requesting two variables added to the data section of the config file, though i don't think it is static (I am happy to adjust once I know what needs to be added). They are currently integrated into crossformer.yml

The adjustment of the quantile scalar is apparent. Here is the difference between the Q at upper levels for a standard scaling (bottom) vs a quantile scaling (top).

image

Additionally, I have added a 'static variables' option to the to_tensor transform. This will now return a feild 'static' which provides the LandSea mask (scaled 0-1) and the topography (scaled 0-2). See Below:

image

This addresses #24

@djgagne
Copy link
Collaborator

djgagne commented Mar 22, 2024

I am impressed with the difference in resolution provided by the quantiletransformer. There appear to be some missing imports in transforms.py. Please add

import pandas as pd
from bridgescaler import read_scaler

at the top of transforms.py.

./build/lib/credit/transforms.py:108:26: F821 undefined name 'pd'
        self.scaler_df = pd.read_parquet(scaler_file)
                         ^
./build/lib/credit/transforms.py:109:61: F821 undefined name 'read_scaler'
        self.scaler_3ds = self.scaler_df["scaler_3d"].apply(read_scaler)
                                                            ^
./build/lib/credit/transforms.py:110:[68](https://github.com/NCAR/miles-credit/actions/runs/8393529079/job/22988683029#step:5:69): F821 undefined name 'read_scaler'
        self.scaler_surfs = self.scaler_df["scaler_surface"].apply(read_scaler)
                                                                   ^
./build/lib/credit/transforms.py:160:49: F821 undefined name 'pd'
                    e3d = xr.concat(var_slices, pd.Index(var_levels, name="variable")
                                                ^
./build/lib/credit/transforms.py:1[69](https://github.com/NCAR/miles-credit/actions/runs/8393529079/job/22988683029#step:5:70):99: F821 undefined name 'pd'
                    e_surf = xr.concat([value[v].sel(time=time) for v in self.surface_variables], pd.Index(self.surface_variables, name="variable")
                                                                                                  ^
./credit/transforms.py:108:26: F821 undefined name 'pd'
        self.scaler_df = pd.read_parquet(scaler_file)
                         ^
./credit/transforms.py:109:61: F821 undefined name 'read_scaler'
        self.scaler_3ds = self.scaler_df["scaler_3d"].apply(read_scaler)
                                                            ^
./credit/transforms.py:110:68: F821 undefined name 'read_scaler'
        self.scaler_surfs = self.scaler_df["scaler_surface"].apply(read_scaler)
                                                                   ^
./credit/transforms.py:160:49: F821 undefined name 'pd'
                    e3d = xr.concat(var_slices, pd.Index(var_levels, name="variable")
                                                ^
./credit/transforms.py:169:99: F[82](https://github.com/NCAR/miles-credit/actions/runs/8393529079/job/22988683029#step:5:83)1 undefined name 'pd'
                    e_surf = xr.concat([value[v].sel(time=time) for v in self.surface_variables], pd.Index(self.surface_variables, name="variable")

Copy link
Collaborator

@djgagne djgagne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think some big speedups could be made if you use the channels_last=False transform functionality that I just added to bridgescaler this week. You shouldn't have to retrain the current scalers just yet.


def inverse_transform(self, x: torch.Tensor) -> torch.Tensor:
device = x.device
tensor = x[:, :(len(self.variables)*self.levels), :, :] #B, Var, H, W
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would store len(self.variables) as a variable in init somewhere. That should save you some time, especially given how many time it is called here.

transformed_surface_tensor = surface_tensor.clone()
#3dvars
rscal_3d = np.transpose(torch.Tensor.numpy(x[:,:(len(self.variables)*self.levels),:,:].values),(0, 2, 3, 1))
self.scaler_3d.inverse_transform(rscal_3d)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a channels_last flag in the bridgescaler distributed transform and inverse_transform methods that can be set to False to do the transform in channels_first order even if the scaler was trained on channels_last data. You shouldn't have to reshape to channels_last anymore.

@dkimpara
Copy link
Collaborator

trainer.py and predict.py need to be updated to incorporate the new quantile scaler. I think another config flag needs to be added to facilitate this too - should we do this in this PR or another one?

@dkimpara
Copy link
Collaborator

dkimpara commented Mar 22, 2024

will as discussed please:

  • remove hardcoded conf variables
  • have classes take in just conf as arg
  • add logging to say which scaler, and whether using static vars

@dkimpara
Copy link
Collaborator

static variables not yet in 1deg file: "/glade/u/home/wchapman/MLWPS/DataLoader/static_variables_ERA5_zhght_onedeg.nc"

Copy link
Collaborator

@dkimpara dkimpara left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

quantile + static vars integrated and tested in train.py. pending speedups to quantile scaler in bridgescaler

@yingkaisha please see the new transforms and integrate into predict.py let me know if you'd like any help. Not sure if inverse transform for quantile exists yet. might need to do that in another PR

@WillyChap
Copy link
Collaborator Author

static variables not yet in 1deg file: "/glade/u/home/wchapman/MLWPS/DataLoader/static_variables_ERA5_zhght_onedeg.nc"

See path: '/glade/u/home/wchapman/MLWPS/DataLoader/LSM_static_variables_ERA5_zhght_onedeg.nc'

@jsschreck jsschreck merged commit 60429e2 into main Mar 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants