You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Given that properties from different datasets can span large dynamic ranges, and/or are very non-Gaussian, we should design a framework for modifying and transforming labels ideally just before loss calculations. As part of this, it may be advantageous to calculate dataset-wide statistics on the fly with caching.
Request attributes
Would this be a refactor of existing code?
Does this proposal require new package dependencies?
Would this change break backwards compatibility?
Does this proposal include a new model?
Does this proposal include a new dataset?
Does this proposal include a new task/workflow?
Related issues
#75 pertains to an issue with normalization not being applied; this solution would supersede it.
Solution description
One solution would be to implement this as a subclass of transform, which mutates data in-place:
On-the-fly statistics could be calculated using a moving-average or something, which is then cached to disk based on the dataset class, and the dataset path. The only issue with this is synchronization: for DDP scenarios, we'd want to make sure statistics are the same across each data loader worker. Could probably do some reduction call, etc.
We can then implement concrete versions of the transforms:
classNormalTransform(AbstractLabelTransform):
# rescales based on mean/stdclassMinMaxTransform(AbstractLabelTransform):
# rescales to [min, max] of specified value, or datasetclassLambdaTransform(AbstractLabelTransform):
# this is a bit dicey, but apply an arbitrary function to a keyclassExponentialTransform(AbstractLabelTransform):
# many properties have long-tailed distributions
The idea would be that you could freely compose these such that different labels can be transformed in different ways.
Alternatively:
As a pl.Callback; since it has access to discrete after/before_x_step regions, which could be helpful in getting access to batch data.
We could take the existing normalization steps that are being used in _compute_losses. However, caching and whatnot isn't as flexible.
Additional notes
A task list based on the transform-based solution (convert to issues/PRs for tracking):
Feature/behavior summary
Given that properties from different datasets can span large dynamic ranges, and/or are very non-Gaussian, we should design a framework for modifying and transforming labels ideally just before loss calculations. As part of this, it may be advantageous to calculate dataset-wide statistics on the fly with caching.
Request attributes
Related issues
#75 pertains to an issue with normalization not being applied; this solution would supersede it.
Solution description
One solution would be to implement this as a subclass of transform, which mutates data in-place:
On-the-fly statistics could be calculated using a moving-average or something, which is then cached to disk based on the dataset class, and the dataset path. The only issue with this is synchronization: for DDP scenarios, we'd want to make sure statistics are the same across each data loader worker. Could probably do some reduction call, etc.
We can then implement concrete versions of the transforms:
The idea would be that you could freely compose these such that different labels can be transformed in different ways.
Alternatively:
pl.Callback
; since it has access to discreteafter/before_x_step
regions, which could be helpful in getting access to batch data._compute_losses
. However, caching and whatnot isn't as flexible.Additional notes
A task list based on the transform-based solution (convert to issues/PRs for tracking):
The text was updated successfully, but these errors were encountered: