Let users optionally derive bulk statistics of the data points belonging to each feature #293

JuliaKukulies · 2023-06-05T20:52:39Z

Finally a PR that solves #153 by implementing an optional extraction of statistics on the data points belonging to each feature. This is done in the segmentation step and produces additional columns in the segmentation output dataframe, if called by the user.

Added tests and updated example notebook and documentation accordingly.

As discussed in #153, a follow-up on this after the transition to xarray will be to save not only the bulk statistics of features but to save the locations of all grid feature points. This is indirectly done in the mask files that can be combined with the actual data, but saving all feature data points would facilitate this analysis step for the user.

codecov · 2023-06-05T20:57:46Z

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (f87ea1e) 55.75% compared to head (55f1d66) 56.35%.
Report is 2 commits behind head on RC_v1.5.x.

Additional details and impacted files

@@              Coverage Diff              @@
##           RC_v1.5.x     #293      +/-   ##
=============================================
+ Coverage      55.75%   56.35%   +0.59%     
=============================================
  Files             15       16       +1     
  Lines           3316     3384      +68     
=============================================
+ Hits            1849     1907      +58     
- Misses          1467     1477      +10

Flag	Coverage Δ
unittests	`56.35% <86.66%> (+0.59%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
tobac/feature_detection.py	`84.21% <100.00%> (+0.26%)`	⬆️
tobac/segmentation.py	`92.81% <100.00%> (-0.02%)`	⬇️
tobac/utils/__init__.py	`100.00% <100.00%> (ø)`
tobac/utils/general.py	`70.94% <100.00%> (+0.09%)`	⬆️
tobac/utils/internal.py	`87.78% <100.00%> (+0.11%)`	⬆️
tobac/utils/bulk_statistics.py	`82.75% <82.75%> (ø)`

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

w-k-jones · 2023-06-05T22:06:33Z

Cool feature! Had a quick look through and it mostly looks good. Couple of questions/suggestions before I do a full review:

Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in segmentation_timestep. I can see why this is also nice to include within the segmentation process as it removes the need for multiple loading for disk for large datasets.
Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy
Bulk statistics on feature detection. Can we add it in the same way?
Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long 😬

JuliaKukulies · 2023-06-06T03:02:35Z

Thanks for the great feedback, @w-k-jones! I really like your suggestions and I think that most of them could actually be implemented as part of this PR. Given that this is for v1.6 we are not in a hurry, so I can work more on this with your input!

1. Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in `segmentation_timestep`. I can see why this is also nice to include within the `segmentation` process as it removes the need for multiple loading for disk for large datasets.

Not entirely sure I understand what you mean here. Do you mean if it is possible to run as a postprocessing step rather than during the segmentation?

2. Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy

Note that we do the looping over features anyhow in the current code to get ncells. But you are right that it is not so effective to extract each feature from the segmentation mask per iteration. Yes, please provide me with some faster alternatives for this :)

3. Bulk statistics on feature detection. Can we add it in the same way?

Sure thing! Are you thinking just the bulk statistics for the feature with the most extreme threshold (since this is the one ending up in the output dataframe)? Or would it make more sense to get the statistics for the feature that was first detected (the least extreme) or all features within features?

4. Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?

Yes, good point, that makes sense! But we should allow for an arbitrary number of custom functions so that multiple statistics can be derived from one segmentation process, right?

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long grimacing

Haha, yes definitely not ideal :) Hope that we can shorten this down or implement elsewhere following your points 1-4.

lettlini

Thanks @JuliaKukulies for adding this!

I have only found a few minor things that should be looked at again. All in all very good work!

tobac/utils/general.py

tobac/segmentation.py

tobac/utils/general.py

examples/Example_Precip_Tracking/Example_Precip_Tracking.ipynb

JuliaKukulies · 2023-06-08T19:41:20Z

Thanks @JuliaKukulies for adding this!

I have only found a few minor things that should be looked at again. All in all very good work!

Thanks for the quick and clear review, @lettlini ! Really good points - I addressed them in my latest commit.

I might ask you for a re-review when I have considered some of the points that concern the main design made by @w-k-jones . I think some at least make sense to address in this PR

w-k-jones · 2023-06-09T12:05:06Z

1. Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in `segmentation_timestep`. I can see why this is also nice to include within the `segmentation` process as it removes the need for multiple loading for disk for large datasets.

Not entirely sure I understand what you mean here. Do you mean if it is possible to run as a postprocessing step rather than during the segmentation?

Yes, sorry for being unclear! Running as a postprocessing step is indeed what I meant.

2. Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy
Note that we do the looping over features anyhow in the current code to get ncells. But you are right that it is not so effective to extract each feature from the segmentation mask per iteration. Yes, please provide me with some faster alternatives for this :)

Here's the function I use for calculating statistics over labelled functions:
https://github.com/w-k-jones/tobac-flow/blob/45b7541f0358fac670b94b78603935a15127d762/src/tobac_flow/utils/label_utils.py#L56-L100
It's similar to scipy.ndimage.labeled_comprehension but a bit more adaptable, including the ability to pass multiple fields to use as inputs to the functions for e.g. calculating weighted averages. You can also give it a function with multiple outputs, so we could calculate all statistics at once.
This approach is faster than doing a groupby over labels.
I've been calculating a lot a bulk properties of tracked objects in my own work, so can provide some examples if that would help! :)

3. Bulk statistics on feature detection. Can we add it in the same way?
Sure thing! Are you thinking just the bulk statistics for the feature with the most extreme threshold (since this is the one ending up in the output dataframe)? Or would it make more sense to get the statistics for the feature that was first detected (the least extreme) or all features within features?

I think just the most extreme threshold, as it would be difficult to do anything else without more complex logic/actually performing segmentation as well!

4. Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?
Yes, good point, that makes sense! But we should allow for an arbitrary number of custom functions so that multiple statistics can be derived from one segmentation process, right?

Yes, we should allow for functions that return multiple values (so e.g. calculate mean and standard deviation), but also a list of different functions. This could be something added at a later date however

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long grimacing

Haha, yes definitely not ideal :) Hope that we can shorten this down or implement elsewhere following your points 1-4.

To be fair this is mainly due to the PBC and 3D changes, I just noticed it when looking at the code now! I'm sure with the switch to xarray in v1.6 we can do some refactoring and make it all a bit more manageable!

JuliaKukulies · 2023-06-09T13:56:29Z

Great, thanks @w-k-jones for more helpful input and clarifications! Yes, if you could point me to some of your examples, that would be great! Your function for calculating statistics over labeled regions helps already a lot.

freemansw1

I haven't run this fully yet on all the datasets I want to, but I am very happy with this so far! Nice addition!

doc/index.rst

doc/segmentation_output.rst

tobac/utils/general.py

freemansw1 · 2023-10-11T18:29:36Z

tobac/utils/general.py

+                    )
+                    if bins[i] > bins[i - 1]
+                    else default
+                    for i in index


A longer-term comment: it may be worth subsetting before calculating statistics. My guess is that it would be faster, but I'm not quite sure. Not something to hold up this PR, though!

Can you clarify what you mean by subsetting in this case? :) Are you referring to the indices?

JuliaKukulies · 2023-10-12T19:26:14Z

From discussion regarding type hints in #337, #341 we should make sure to add from __future__ import annotations and possibly consider using from typing import Union rather than | for union type hints to ensure back-compatibility with python versions before 3.10

I changed to Union for this PR

JuliaKukulies · 2023-10-16T19:26:00Z

@w-k-jones and @freemansw1 I think I addressed all your comments, so it would be nice if you could re-review whenever it is convenient for you! After we merge this PR, I will do ahead and prepare the 1.5.2 release with #337, #351 and #341 as we discussed in the last dev meeting

w-k-jones

Just a small change to replicate previous behaviour for the ncells when no segment is present, otherwise happy to merge

tobac/segmentation.py

Set default value for ncells as 0 instead of None to preserve behaviour from the previous implementation Co-authored-by: William Jones <[email protected]>

freemansw1

Very nice change, @JuliaKukulies ! I've tested on a variety of datasets and seem to get reasonable results. I'm extremely happy that we're getting these changes in now.

JuliaKukulies · 2023-11-01T15:51:59Z

Very nice change, @JuliaKukulies ! I've tested on a variety of datasets and seem to get reasonable results. I'm extremely happy that we're getting these changes in now.

Thank you very much for checking this new feature with different datasets! I will go ahead and merge when the jupyter tests pass (which should be the case now that I have merged Wills last PR with the zenodo path changes)

tobac/utils/bulk_statistics.py

Co-authored-by: William Jones <[email protected]>

w-k-jones · 2023-11-01T16:43:50Z

Just found an annoying bug with missing segments in which all the features which don't have a label in the segment mask before the first present label will get the value None instead of the default. Will have a look into a quick fix...

…e columns

…to floats and causes statistic calculation to fail

JuliaKukulies · 2023-11-07T16:14:04Z

Just found an annoying bug with missing segments in which all the features that don't have a label in the segment mask before the first present label will get the value None instead of the default. Will have a look into a quick fix...

Thanks for these fixes @w-k-jones ! And ha, of course, I found and fixed another bug that caused the failing of the
notebook check for Example_Track_on_Radar_Segment_on_Satellite/track_on_radar_segment_on_satellite.ipynb. The issue was that utils.transform_featurepoints uses the irispandas_to_xarray decorator and xr.DataArray.where() which converts the feature values from integers to floats (apparently this is intended behavior, but not very intuitive). Something to remember for the xarray transition! We had this issue in the notebook already before this PR but now the segmentation step raises an exception because utils.get_statistics() needs the feature IDs to be integers.

Do you want to have another quick look @freemansw1 @w-k-jones or are we ready to merge?

And for the future: Would it make sense to implement tests that check the datatypes of the output data frames at least in all functions that modify the data frames?

freemansw1 · 2023-11-07T22:39:56Z

Do you want to have another quick look @freemansw1 @w-k-jones or are we ready to merge?

I am happy for this to be merged.

And for the future: Would it make sense to implement tests that check the datatypes of the output data frames at least in all functions that modify the data frames?

Yes, we should probably introduce those as part of our typical suite.

JuliaKukulies added 10 commits May 23, 2023 16:00

added optional function to derive bulk statistics of segmented features

4d0e9d2

some fixes

796d1e3

fixed the saving of percentiles for each feature in dataframe

58dca76

black formatting

fc636d6

black formatting

a3cd532

added documentation of new output columns

d642596

added csv

0b28766

added tests, example, and modified documentation

a585359

merged in latest changes from RC_v1.5.0 and solved merge conflict

3b570a4

and black formatting again

a265c16

JuliaKukulies added this to the Version 1.6 milestone Jun 5, 2023

JuliaKukulies requested review from freemansw1, w-k-jones, kelcyno and lettlini June 5, 2023 20:53

JuliaKukulies added the enhancement Addition of new features, or improved functionality of existing features label Jun 5, 2023

JuliaKukulies changed the base branch from main to RC_v1.5.0 June 5, 2023 20:55

modified example to avoid error when grouping pd.DataFrame

b577f22

JuliaKukulies marked this pull request as draft June 6, 2023 12:57

lettlini requested changes Jun 8, 2023

View reviewed changes

addressed Kolyas review points

d359c44

freemansw1 modified the milestones: Version 1.6, Version 1.5.x Jun 9, 2023

JuliaKukulies changed the base branch from RC_v1.5.0 to RC_v1.5.x July 14, 2023 20:15

freemansw1 mentioned this pull request Oct 11, 2023

Switch feature detection to use xarray internally #354

Merged

11 tasks

freemansw1 reviewed Oct 12, 2023

View reviewed changes

kukulies added 2 commits October 12, 2023 11:50

create submodule and fixed Union typehints

0f43cca

black formatting, fixed imports and init files for utils

bb5d909

kukulies added 2 commits October 12, 2023 13:39

addressed Seans comments for documentation

9607b4d

fixed import as functions moved to new submodule

67b383f

JuliaKukulies requested a review from freemansw1 October 16, 2023 19:23

w-k-jones approved these changes Oct 25, 2023

View reviewed changes

tobac/segmentation.py Outdated Show resolved Hide resolved

Update tobac/segmentation.py

1c84353

Set default value for ncells as 0 instead of None to preserve behaviour from the previous implementation Co-authored-by: William Jones <[email protected]>

JuliaKukulies mentioned this pull request Oct 25, 2023

Update path to example data in notebooks #357

Merged

11 tasks

freemansw1 approved these changes Nov 1, 2023

View reviewed changes

Update path to zenodo files in notebooks

8c6d670

w-k-jones reviewed Nov 1, 2023

View reviewed changes

tobac/utils/bulk_statistics.py Outdated Show resolved Hide resolved

Update tobac/utils/bulk_statistics.py

838357f

Co-authored-by: William Jones <[email protected]>

w-k-jones and others added 6 commits November 2, 2023 11:38

Fix initial features being missed and wrong dtype for result datafram…

def442a

…e columns

Fix setting with copy warnings

8f8dea5

Fix merge conflict

bccab49

updated zenodo paths in these new notebooks as well

80ef32c

fixed bug in transform_featurepoints where feature IDs are converted …

09adaf2

…to floats and causes statistic calculation to fail

formatting

55f1d66

JuliaKukulies merged commit 85f8f3a into tobac-project:RC_v1.5.x Nov 7, 2023
7 checks passed

This was referenced Nov 8, 2023

Add type hints to feature detection #337

Merged

Add matrix CI Tests on multiple Python versions and OS versions #353

Merged

JuliaKukulies mentioned this pull request Nov 8, 2023

Release 1.5.2 #363

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Let users optionally derive bulk statistics of the data points belonging to each feature #293

Let users optionally derive bulk statistics of the data points belonging to each feature #293

JuliaKukulies commented Jun 5, 2023

codecov bot commented Jun 5, 2023 •

edited

Loading

w-k-jones commented Jun 5, 2023

JuliaKukulies commented Jun 6, 2023 •

edited

Loading

lettlini left a comment

JuliaKukulies commented Jun 8, 2023

w-k-jones commented Jun 9, 2023

JuliaKukulies commented Jun 9, 2023

freemansw1 left a comment

freemansw1 Oct 11, 2023

JuliaKukulies Oct 12, 2023

JuliaKukulies commented Oct 12, 2023

JuliaKukulies commented Oct 16, 2023

w-k-jones left a comment

freemansw1 left a comment

JuliaKukulies commented Nov 1, 2023

w-k-jones commented Nov 1, 2023

JuliaKukulies commented Nov 7, 2023 •

edited

Loading

freemansw1 commented Nov 7, 2023

Let users optionally derive bulk statistics of the data points belonging to each feature #293

Let users optionally derive bulk statistics of the data points belonging to each feature #293

Conversation

JuliaKukulies commented Jun 5, 2023

codecov bot commented Jun 5, 2023 • edited Loading

Codecov Report

w-k-jones commented Jun 5, 2023

JuliaKukulies commented Jun 6, 2023 • edited Loading

lettlini left a comment

Choose a reason for hiding this comment

JuliaKukulies commented Jun 8, 2023

w-k-jones commented Jun 9, 2023

JuliaKukulies commented Jun 9, 2023

freemansw1 left a comment

Choose a reason for hiding this comment

freemansw1 Oct 11, 2023

Choose a reason for hiding this comment

JuliaKukulies Oct 12, 2023

Choose a reason for hiding this comment

JuliaKukulies commented Oct 12, 2023

JuliaKukulies commented Oct 16, 2023

w-k-jones left a comment

Choose a reason for hiding this comment

freemansw1 left a comment

Choose a reason for hiding this comment

JuliaKukulies commented Nov 1, 2023

w-k-jones commented Nov 1, 2023

JuliaKukulies commented Nov 7, 2023 • edited Loading

freemansw1 commented Nov 7, 2023

codecov bot commented Jun 5, 2023 •

edited

Loading

JuliaKukulies commented Jun 6, 2023 •

edited

Loading

JuliaKukulies commented Nov 7, 2023 •

edited

Loading