Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

62 bug mae method doesnt like pdseries when using dimension preservation #64

Conversation

tennlee
Copy link
Collaborator

@tennlee tennlee commented Sep 18, 2023

It is not clear that this implementation is the best way to resolve the underlying issue, but does present a possible way forward for now, for these specific scores. Comments appreciated.

This approach aims to make the returns types consistent with what would be expected from a pure-Pandas score implementation. This means slightly second-guessing the results from the to_pandas() method from xarray which returns a numpy.ndarray at times when a numpy.float64 seems more appropriate.

Adding additional test coverage of different scenarios working with more complex pandas input data would be helpful.

@bomRob
Copy link
Collaborator

bomRob commented Sep 18, 2023

What a turn around! Service in this place is great.

5 Stars! - would recommend...

Copy link

@benowen-bom benowen-bom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes in this PR put forward the solution of converting pd.Series inputs for fcst and obs args to xr.DataArray objects to facilitate the calculation and then convert back to pd.Series. In doing so this extends the capability of doing dimensional reduction for pd.Series objects enabling use of the preserve_dims (and reduce_dims) argument addressing the #62.

This choice seems reasonable, but it does set a precedent around supported behaviour for pd.Series and mixed type inputs; it is also not obvious to me whether these functions do support pd.DataFrame and xr.Dataset objects. What is the intended usage here and more generally across scores?

Based on the choices above, I think further tests should be included to capture the full array of possible usage cases.

Comment on lines 44 to 55
as_pandas_series = False
both_pandas = False
if type(fcst) == pandas.Series:
fcst = fcst.to_xarray()
as_pandas_series = True

if type(obs) == pandas.Series:
obs = obs.to_xarray()
as_pandas_series = True
if as_pandas_series == True:
both_pandas = True

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This snippet could be put into a separate function given it is also used in mae.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While that's true, I think it would make the implementation more complex. I'll keep it in mind for the future, particularly if the same approach is taken in any additional functions.

Comment on lines 71 to 74
if both_pandas:
_mse = _mse.to_pandas()
if isinstance(_mse, numpy.ndarray):
_mse = numpy.float64(_mse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a couple of questions below which are more general considerations than exactly what has been put forward in this PR, but I think are important to the expected behaviour of the functions here:

  1. Has the code been tested where one or both of fcst and obs are pd.DataFrame or xr.Dataset. Given these types contain multiple variables, how do the fcst/obs data align in these instances?

  2. The solution in this PR imparts the following behaviour based on input datatypes (I've neglected xr.Datasets and pd.DataFrames here following on from 1):

    • fcst, obs both xr.DataArray -> return type is xr.DataArray
    • one of fcst, obs is pd.Series, other is xr.DataArray -> return type is xr.DataArray
    • fcst, obs both pd.Series -> return type is pd.Series

    Given the precedent set here, is the intended behaviour for other functions within scores library?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding (1) I have added more test cases to cover more varied types. It works well, but there are some requirements on the naming of coordinates for mixed types, as pandas when converted to xarray calls its variable 'index', so the obs dimension must also be called 'index'. This is a gotcha, but an elegant solution doesn't spring to mind. I will keep an eye out for more complex examples to add to the testing in future to try to make sure things work okay in practice.

Regarding (2), yes I think the behaviour in this MR a good precedence. Support pandas where there is the opportunity, but if xarray is 'in the mix', then that's what you get back. Thanks for raising it for mindful consideration.

Comment on lines 71 to 74
if both_pandas:
_mse = _mse.to_pandas()
if isinstance(_mse, numpy.ndarray):
_mse = numpy.float64(_mse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Conversion here to np.float64. If inputs are of another type (say float32), should the return type match that of the inputs?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a really good question and I don't immediately know the answer. I think the current implementation is unlikely to cause major issues in that circumstance because you're going to get a compatible type rather than an incompatible one. It would, I suppose, be possible to capture the type of the original fcst/obs data and pick the highest precision of those types, but I'm not even sure that's the right thing to do. I'll have a bit more of a think on it. Feel free to contribute a test case if you think there's a situation where the current implementation might cause an issue.

Comment on lines 71 to 74
if both_pandas:
_mse = _mse.to_pandas()
if isinstance(_mse, numpy.ndarray):
_mse = numpy.float64(_mse)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, the code snippet here is very similar to that within the mae function. Does this warrant putting this into a separate function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per previously, I'm going with 'not quite yet'. If another function does the same thing, then I think it's worth it, until then, I think it complicates the implementation.

obs = pd.Series(scores.sample_data.simple_observations())

# Test MSE
xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')
xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')

xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')
pd_preserved = scores.continuous.mse(fcst, obs, preserve_dims='all')
assert (pd_preserved == pd.Series([1, 1, 1, 1, 9, 9, 9, 9])).all()
assert (pd_preserved == xr_preserved).all()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To me this assert is comparing two different datatypes (xr.DataArray and pd.Series). I honestly haven'y tried this myself, but based on the logic here I presume it is comparing the underlying numpy.ndarray underneath both these objects?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's checking the values are equal, and that equality doesn't change depending on data type. What the actual implementation of the equal method is, I am not sure. The test is confirming that the expectation is preserved across the types.

Comment on lines 43 to 46
def test_pandas_series_preserve():
"""
Test calculation works correctly on pandas series
"""

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the functions in src/scores/continuous.py explicitly state that Dimensional reduction is not supported for pandas, however here preserve_dims='all' is used. Given the change in how the function evaluates the metric for pd.Series objects, it would seem that dimension reduction is now supported for pd.Series objects? If that is true, the comments should be updated to reflect this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How dimension handling should be done for pandas is ... interesting. For the Series specifically, it is inherently a one-dimensional data type (even if the elements have a length), so preserving and conserving don't have an effect. So it's more like preserve/reduce doesn't specifically break for series. For a dataframe, however, they could have multiple dimensions, and preservation and reduction are not implemented for dataframes. It's worthy of consideration, but a lot of use cases will be supported through series/columns, rather than needing to pass in a full dataframe.

I have tried to extend the implementation just one step in this MR. I'll raise a separate issue for dataframe handling. I have clarified that dimensional reduction isn't supported for pandas dataframes.

@tennlee
Copy link
Collaborator Author

tennlee commented Sep 26, 2023

Thanks so much for your constructive feedback. Some of those questions will take some 'thinking time' to respond to, and it's a busy week this week, so it may take a little time to resolve. I just wanted to pop something in right away to let you know it's been read and appreciated.

Issue - unclear what the best return type should be for various arguments.
Pandas and xarray.to_pandas have some inconsistent edge cases
@tennlee tennlee force-pushed the 62-bug-mae-method-doesnt-like-pdseries-when-using-dimension-preservation branch from dbe0bc4 to e5a2e19 Compare October 19, 2023 05:36
@tennlee
Copy link
Collaborator Author

tennlee commented Oct 19, 2023

I have just added an improvement whereby the pandas series name is used to infer matching against the proper xarray dimension name to improve robustness, and an error message to catch the situation where this hasn't been done properly.

Copy link
Collaborator

@aidanjgriffiths aidanjgriffiths left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before this MR goes ahead we need to have a frank discussion on whether there is value in committing this multi-input type format. Right now scores is a candidate to replace some internal projects which heavily use xarray and I believe that should be the scope of this package. I think supporting more than just DataArrays and Datasets will add a lot of testing and development overhead which will out weigh any potential benefits. There's not really a barrier of entry if you want to use pd.Series objects since they are easily converted to xr.DataArray's but that should occur outside of the core metric functions.

Comment on lines +209 to +237
fcst_is_pandas = isinstance(fcst, pandas.Series)
obs_is_pandas = isinstance(obs, pandas.Series)
both_pandas = fcst_is_pandas and obs_is_pandas
either_pandas = fcst_is_pandas or obs_is_pandas
fcst_is_xarray = isinstance(fcst, (xarray.Dataset,xarray.DataArray))
obs_is_xarray = isinstance(obs, (xarray.Dataset,xarray.DataArray))
either_xarray = fcst_is_xarray or obs_is_xarray
mixed_pandas_xarray = either_pandas and either_xarray

if isinstance(fcst, pandas.Series):

fcst = fcst.to_xarray()
if mixed_pandas_xarray:
series_name = fcst.name
if series_name is None:
raise ValueError(PANDAS_NAME_MATCHING_WARNING)

fcst = fcst.rename({'index': series_name})

if isinstance(obs, pandas.Series):

obs = obs.to_xarray()

if mixed_pandas_xarray:
series_name = fcst.name
if series_name is None:
raise ValueError(PANDAS_NAME_MATCHING_WARNING)

obs = obs.rename({'index': series_name})
Copy link
Collaborator

@aidanjgriffiths aidanjgriffiths Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @benowen-bom, at the very least this behaviour should be in its own function and if I had my own way, outside the scope of this function.

@tennlee tennlee closed this Oct 25, 2023
@tennlee tennlee deleted the 62-bug-mae-method-doesnt-like-pdseries-when-using-dimension-preservation branch March 29, 2024 01:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG - MAE method doesn't like pd.Series when using dimension preservation
4 participants