62 bug mae method doesnt like pdseries when using dimension preservation #64

tennlee · 2023-09-18T05:38:51Z

It is not clear that this implementation is the best way to resolve the underlying issue, but does present a possible way forward for now, for these specific scores. Comments appreciated.

This approach aims to make the returns types consistent with what would be expected from a pure-Pandas score implementation. This means slightly second-guessing the results from the to_pandas() method from xarray which returns a numpy.ndarray at times when a numpy.float64 seems more appropriate.

Adding additional test coverage of different scenarios working with more complex pandas input data would be helpful.

bomRob · 2023-09-18T05:48:52Z

What a turn around! Service in this place is great.

5 Stars! - would recommend...

benowen-bom

The changes in this PR put forward the solution of converting pd.Series inputs for fcst and obs args to xr.DataArray objects to facilitate the calculation and then convert back to pd.Series. In doing so this extends the capability of doing dimensional reduction for pd.Series objects enabling use of the preserve_dims (and reduce_dims) argument addressing the #62.

This choice seems reasonable, but it does set a precedent around supported behaviour for pd.Series and mixed type inputs; it is also not obvious to me whether these functions do support pd.DataFrame and xr.Dataset objects. What is the intended usage here and more generally across scores?

Based on the choices above, I think further tests should be included to capture the full array of possible usage cases.

benowen-bom · 2023-09-26T00:31:00Z

src/scores/continuous.py

+    as_pandas_series = False
+    both_pandas = False
+    if type(fcst) == pandas.Series:
+        fcst = fcst.to_xarray()
+        as_pandas_series = True
+
+    if type(obs) == pandas.Series:
+        obs = obs.to_xarray()
+        as_pandas_series = True
+        if as_pandas_series == True:
+            both_pandas = True
+


This snippet could be put into a separate function given it is also used in mae.

While that's true, I think it would make the implementation more complex. I'll keep it in mind for the future, particularly if the same approach is taken in any additional functions.

benowen-bom · 2023-09-26T00:45:20Z

src/scores/continuous.py

+    if both_pandas:
+        _mse = _mse.to_pandas()
+        if isinstance(_mse, numpy.ndarray):
+            _mse = numpy.float64(_mse)        


I have a couple of questions below which are more general considerations than exactly what has been put forward in this PR, but I think are important to the expected behaviour of the functions here:

Has the code been tested where one or both of fcst and obs are pd.DataFrame or xr.Dataset. Given these types contain multiple variables, how do the fcst/obs data align in these instances?

The solution in this PR imparts the following behaviour based on input datatypes (I've neglected xr.Datasets and pd.DataFrames here following on from 1):

fcst, obs both xr.DataArray -> return type is xr.DataArray

one of fcst, obs is pd.Series, other is xr.DataArray -> return type is xr.DataArray

fcst, obs both pd.Series -> return type is pd.Series

Given the precedent set here, is the intended behaviour for other functions within scores library?

Regarding (1) I have added more test cases to cover more varied types. It works well, but there are some requirements on the naming of coordinates for mixed types, as pandas when converted to xarray calls its variable 'index', so the obs dimension must also be called 'index'. This is a gotcha, but an elegant solution doesn't spring to mind. I will keep an eye out for more complex examples to add to the testing in future to try to make sure things work okay in practice.

Regarding (2), yes I think the behaviour in this MR a good precedence. Support pandas where there is the opportunity, but if xarray is 'in the mix', then that's what you get back. Thanks for raising it for mindful consideration.

benowen-bom · 2023-09-26T00:47:00Z

src/scores/continuous.py

+    if both_pandas:
+        _mse = _mse.to_pandas()
+        if isinstance(_mse, numpy.ndarray):
+            _mse = numpy.float64(_mse)        


Conversion here to np.float64. If inputs are of another type (say float32), should the return type match that of the inputs?

That's a really good question and I don't immediately know the answer. I think the current implementation is unlikely to cause major issues in that circumstance because you're going to get a compatible type rather than an incompatible one. It would, I suppose, be possible to capture the type of the original fcst/obs data and pick the highest precision of those types, but I'm not even sure that's the right thing to do. I'll have a bit more of a think on it. Feel free to contribute a test case if you think there's a situation where the current implementation might cause an issue.

benowen-bom · 2023-09-26T00:48:46Z

src/scores/continuous.py

+    if both_pandas:
+        _mse = _mse.to_pandas()
+        if isinstance(_mse, numpy.ndarray):
+            _mse = numpy.float64(_mse)        


Again, the code snippet here is very similar to that within the mae function. Does this warrant putting this into a separate function?

As per previously, I'm going with 'not quite yet'. If another function does the same thing, then I think it's worth it, until then, I think it complicates the implementation.

benowen-bom · 2023-09-26T00:51:40Z

tests/scores/test_continuous.py

+    obs = pd.Series(scores.sample_data.simple_observations())
+
+    # Test MSE
+    xr_preserved =  scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')


Suggested change

xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')

xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')

benowen-bom · 2023-09-26T00:55:58Z

tests/scores/test_continuous.py

+    xr_preserved =  scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')
+    pd_preserved = scores.continuous.mse(fcst, obs, preserve_dims='all')
+    assert (pd_preserved == pd.Series([1, 1, 1, 1, 9, 9, 9, 9])).all()
+    assert (pd_preserved == xr_preserved).all()


To me this assert is comparing two different datatypes (xr.DataArray and pd.Series). I honestly haven'y tried this myself, but based on the logic here I presume it is comparing the underlying numpy.ndarray underneath both these objects?

I think it's checking the values are equal, and that equality doesn't change depending on data type. What the actual implementation of the equal method is, I am not sure. The test is confirming that the expectation is preserved across the types.

benowen-bom · 2023-09-26T00:59:48Z

tests/scores/test_continuous.py

+def test_pandas_series_preserve():
+    """
+    Test calculation works correctly on pandas series
+    """


All the functions in src/scores/continuous.py explicitly state that Dimensional reduction is not supported for pandas, however here preserve_dims='all' is used. Given the change in how the function evaluates the metric for pd.Series objects, it would seem that dimension reduction is now supported for pd.Series objects? If that is true, the comments should be updated to reflect this.

How dimension handling should be done for pandas is ... interesting. For the Series specifically, it is inherently a one-dimensional data type (even if the elements have a length), so preserving and conserving don't have an effect. So it's more like preserve/reduce doesn't specifically break for series. For a dataframe, however, they could have multiple dimensions, and preservation and reduction are not implemented for dataframes. It's worthy of consideration, but a lot of use cases will be supported through series/columns, rather than needing to pass in a full dataframe.

I have tried to extend the implementation just one step in this MR. I'll raise a separate issue for dataframe handling. I have clarified that dimensional reduction isn't supported for pandas dataframes.

tennlee · 2023-09-26T02:06:44Z

Thanks so much for your constructive feedback. Some of those questions will take some 'thinking time' to respond to, and it's a busy week this week, so it may take a little time to resolve. I just wanted to pop something in right away to let you know it's been read and appreciated.

Issue - unclear what the best return type should be for various arguments. Pandas and xarray.to_pandas have some inconsistent edge cases

…pandas users

…das/xarray inputs

tennlee · 2023-10-19T07:47:02Z

I have just added an improvement whereby the pandas series name is used to infer matching against the proper xarray dimension name to improve robustness, and an error message to catch the situation where this hasn't been done properly.

aidanjgriffiths

Before this MR goes ahead we need to have a frank discussion on whether there is value in committing this multi-input type format. Right now scores is a candidate to replace some internal projects which heavily use xarray and I believe that should be the scope of this package. I think supporting more than just DataArrays and Datasets will add a lot of testing and development overhead which will out weigh any potential benefits. There's not really a barrier of entry if you want to use pd.Series objects since they are easily converted to xr.DataArray's but that should occur outside of the core metric functions.

aidanjgriffiths · 2023-10-20T04:05:08Z

src/scores/continuous/standard_impl.py

+    fcst_is_pandas = isinstance(fcst, pandas.Series)
+    obs_is_pandas = isinstance(obs, pandas.Series)
+    both_pandas = fcst_is_pandas and obs_is_pandas
+    either_pandas = fcst_is_pandas or obs_is_pandas
+    fcst_is_xarray = isinstance(fcst, (xarray.Dataset,xarray.DataArray))
+    obs_is_xarray = isinstance(obs, (xarray.Dataset,xarray.DataArray))
+    either_xarray = fcst_is_xarray or obs_is_xarray
+    mixed_pandas_xarray = either_pandas and either_xarray
+
+    if isinstance(fcst, pandas.Series):
+
+        fcst = fcst.to_xarray()        
+        if mixed_pandas_xarray:
+            series_name = fcst.name        
+            if series_name is None:
+                raise ValueError(PANDAS_NAME_MATCHING_WARNING)
+
+            fcst = fcst.rename({'index': series_name})
+
+    if isinstance(obs, pandas.Series):
+
+        obs = obs.to_xarray()
+
+        if mixed_pandas_xarray:
+            series_name = fcst.name
+            if series_name is None:
+                raise ValueError(PANDAS_NAME_MATCHING_WARNING)
+
+            obs = obs.rename({'index': series_name})


I agree with @benowen-bom, at the very least this behaviour should be in its own function and if I had my own way, outside the scope of this function.

tennlee linked an issue Sep 18, 2023 that may be closed by this pull request

BUG - MAE method doesn't like pd.Series when using dimension preservation #62

Closed

tennlee requested review from nicholasloveday and aidanjgriffiths September 19, 2023 00:59

benowen-bom reviewed Sep 26, 2023

View reviewed changes

tennlee added 2 commits October 19, 2023 16:35

Draft implementation of scoring for Pandas Series type

aabc71d

Issue - unclear what the best return type should be for various arguments. Pandas and xarray.to_pandas have some inconsistent edge cases

Adjust return from values to match what would be more consistent for …

e5a2e19

…pandas users

tennlee force-pushed the 62-bug-mae-method-doesnt-like-pdseries-when-using-dimension-preservation branch from dbe0bc4 to e5a2e19 Compare October 19, 2023 05:36

tennlee added 4 commits October 19, 2023 17:18

Added examples showing the use of data frames, and mixed types

a48069a

Clarify docstrings around dataframe dimension handling

72f5d73

Lint improvements

93e5fdd

Added more elegant handling and improved error messages for mixed pan…

ebd16f6

…das/xarray inputs

aidanjgriffiths reviewed Oct 20, 2023

View reviewed changes

tennlee closed this Oct 25, 2023

tennlee deleted the 62-bug-mae-method-doesnt-like-pdseries-when-using-dimension-preservation branch March 29, 2024 01:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

62 bug mae method doesnt like pdseries when using dimension preservation #64

62 bug mae method doesnt like pdseries when using dimension preservation #64

tennlee commented Sep 18, 2023

bomRob commented Sep 18, 2023

benowen-bom left a comment

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

benowen-bom Sep 26, 2023

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

benowen-bom Sep 26, 2023

tennlee Oct 19, 2023

tennlee commented Sep 26, 2023

tennlee commented Oct 19, 2023

aidanjgriffiths left a comment

aidanjgriffiths Oct 20, 2023 •

edited

Loading

	xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')
	xr_preserved = scores.continuous.mse(xda_fcst, xda_obs, preserve_dims='all')

62 bug mae method doesnt like pdseries when using dimension preservation #64

62 bug mae method doesnt like pdseries when using dimension preservation #64

Conversation

tennlee commented Sep 18, 2023

bomRob commented Sep 18, 2023

benowen-bom left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tennlee commented Sep 26, 2023

tennlee commented Oct 19, 2023

aidanjgriffiths left a comment

Choose a reason for hiding this comment

aidanjgriffiths Oct 20, 2023 • edited Loading

Choose a reason for hiding this comment

aidanjgriffiths Oct 20, 2023 •

edited

Loading