Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let users optionally derive bulk statistics of the data points belonging to each feature #293

Merged

Conversation

JuliaKukulies
Copy link
Member

Finally a PR that solves #153 by implementing an optional extraction of statistics on the data points belonging to each feature. This is done in the segmentation step and produces additional columns in the segmentation output dataframe, if called by the user.

Added tests and updated example notebook and documentation accordingly.

As discussed in #153, a follow-up on this after the transition to xarray will be to save not only the bulk statistics of features but to save the locations of all grid feature points. This is indirectly done in the mask files that can be combined with the actual data, but saving all feature data points would facilitate this analysis step for the user.

  • Have you followed our guidelines in CONTRIBUTING.md?
  • Have you self-reviewed your code and corrected any misspellings?
  • Have you written documentation that is easy to understand?
  • Have you written descriptive commit messages?
  • Have you added NumPy docstrings for newly added functions?
  • Have you formatted your code using black?
  • If you have introduced a new functionality, have you added adequate unit tests? test_get_statistics() and test_segmentation_multiple_features()
  • Have all tests passed in your local clone?
  • If you have introduced a new functionality, have you added an example notebook? modified existing notebook
  • Have you kept your pull request small and limited so that it is easy to review?
  • Have the newest changes from this branch been merged? merged in latest changes from RC_v1.5.0

@JuliaKukulies JuliaKukulies added this to the Version 1.6 milestone Jun 5, 2023
@JuliaKukulies JuliaKukulies added the enhancement Addition of new features, or improved functionality of existing features label Jun 5, 2023
@JuliaKukulies JuliaKukulies changed the base branch from main to RC_v1.5.0 June 5, 2023 20:55
@codecov
Copy link

codecov bot commented Jun 5, 2023

Codecov Report

Attention: 10 lines in your changes are missing coverage. Please review.

Comparison is base (f87ea1e) 55.75% compared to head (55f1d66) 56.35%.
Report is 2 commits behind head on RC_v1.5.x.

Additional details and impacted files
@@              Coverage Diff              @@
##           RC_v1.5.x     #293      +/-   ##
=============================================
+ Coverage      55.75%   56.35%   +0.59%     
=============================================
  Files             15       16       +1     
  Lines           3316     3384      +68     
=============================================
+ Hits            1849     1907      +58     
- Misses          1467     1477      +10     
Flag Coverage Δ
unittests 56.35% <86.66%> (+0.59%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
tobac/feature_detection.py 84.21% <100.00%> (+0.26%) ⬆️
tobac/segmentation.py 92.81% <100.00%> (-0.02%) ⬇️
tobac/utils/__init__.py 100.00% <100.00%> (ø)
tobac/utils/general.py 70.94% <100.00%> (+0.09%) ⬆️
tobac/utils/internal.py 87.78% <100.00%> (+0.11%) ⬆️
tobac/utils/bulk_statistics.py 82.75% <82.75%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@w-k-jones
Copy link
Member

Cool feature! Had a quick look through and it mostly looks good. Couple of questions/suggestions before I do a full review:

  1. Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in segmentation_timestep. I can see why this is also nice to include within the segmentation process as it removes the need for multiple loading for disk for large datasets.
  2. Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy
  3. Bulk statistics on feature detection. Can we add it in the same way?
  4. Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long 😬

@JuliaKukulies
Copy link
Member Author

JuliaKukulies commented Jun 6, 2023

Thanks for the great feedback, @w-k-jones! I really like your suggestions and I think that most of them could actually be implemented as part of this PR. Given that this is for v1.6 we are not in a hurry, so I can work more on this with your input!

1. Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in `segmentation_timestep`. I can see why this is also nice to include within the `segmentation` process as it removes the need for multiple loading for disk for large datasets.

Not entirely sure I understand what you mean here. Do you mean if it is possible to run as a postprocessing step rather than during the segmentation?

2. Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy

Note that we do the looping over features anyhow in the current code to get ncells. But you are right that it is not so effective to extract each feature from the segmentation mask per iteration. Yes, please provide me with some faster alternatives for this :)

3. Bulk statistics on feature detection. Can we add it in the same way?

Sure thing! Are you thinking just the bulk statistics for the feature with the most extreme threshold (since this is the one ending up in the output dataframe)? Or would it make more sense to get the statistics for the feature that was first detected (the least extreme) or all features within features?

4. Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?

Yes, good point, that makes sense! But we should allow for an arbitrary number of custom functions so that multiple statistics can be derived from one segmentation process, right?

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long grimacing

Haha, yes definitely not ideal :) Hope that we can shorten this down or implement elsewhere following your points 1-4.

@JuliaKukulies JuliaKukulies marked this pull request as draft June 6, 2023 12:57
Copy link
Collaborator

@lettlini lettlini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @JuliaKukulies for adding this!

I have only found a few minor things that should be looked at again. All in all very good work!

tobac/utils/general.py Outdated Show resolved Hide resolved
tobac/segmentation.py Outdated Show resolved Hide resolved
tobac/utils/general.py Outdated Show resolved Hide resolved
@JuliaKukulies
Copy link
Member Author

Thanks @JuliaKukulies for adding this!

I have only found a few minor things that should be looked at again. All in all very good work!

Thanks for the quick and clear review, @lettlini ! Really good points - I addressed them in my latest commit.

I might ask you for a re-review when I have considered some of the points that concern the main design made by @w-k-jones . I think some at least make sense to address in this PR

@w-k-jones
Copy link
Member

1. Can this be run offline? If we implement a function to call this offline (i.e. calculate stats for all features/segments that are previously detected) it might make for a nicer implementation in `segmentation_timestep`. I can see why this is also nice to include within the `segmentation` process as it removes the need for multiple loading for disk for large datasets.

Not entirely sure I understand what you mean here. Do you mean if it is possible to run as a postprocessing step rather than during the segmentation?

Yes, sorry for being unclear! Running as a postprocessing step is indeed what I meant.

2. Performance. How long does this take to run? Having a look, it loops over each feature and does a comparison to find each region individually, which can be quite slow. I have some faster alternatives for larger datasets that I can include if you're happy

Note that we do the looping over features anyhow in the current code to get ncells. But you are right that it is not so effective to extract each feature from the segmentation mask per iteration. Yes, please provide me with some faster alternatives for this :)

Here's the function I use for calculating statistics over labelled functions:
https://github.com/w-k-jones/tobac-flow/blob/45b7541f0358fac670b94b78603935a15127d762/src/tobac_flow/utils/label_utils.py#L56-L100
It's similar to scipy.ndimage.labeled_comprehension but a bit more adaptable, including the ability to pass multiple fields to use as inputs to the functions for e.g. calculating weighted averages. You can also give it a function with multiple outputs, so we could calculate all statistics at once.
This approach is faster than doing a groupby over labels.
I've been calculating a lot a bulk properties of tracked objects in my own work, so can provide some examples if that would help! :)

3. Bulk statistics on feature detection. Can we add it in the same way?

Sure thing! Are you thinking just the bulk statistics for the feature with the most extreme threshold (since this is the one ending up in the output dataframe)? Or would it make more sense to get the statistics for the feature that was first detected (the least extreme) or all features within features?

I think just the most extreme threshold, as it would be difficult to do anything else without more complex logic/actually performing segmentation as well!

4. Can we implement this in a way in which the user can supply a custom function to calculate over each segmented feature, rather than the current default statistics?

Yes, good point, that makes sense! But we should allow for an arbitrary number of custom functions so that multiple statistics can be derived from one segmentation process, right?

Yes, we should allow for functions that return multiple values (so e.g. calculate mean and standard deviation), but also a list of different functions. This could be something added at a later date however

Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements!

Also segmentation_timestep is now 750 lines long grimacing

Haha, yes definitely not ideal :) Hope that we can shorten this down or implement elsewhere following your points 1-4.

To be fair this is mainly due to the PBC and 3D changes, I just noticed it when looking at the code now! I'm sure with the switch to xarray in v1.6 we can do some refactoring and make it all a bit more manageable!

@JuliaKukulies
Copy link
Member Author

Great, thanks @w-k-jones for more helpful input and clarifications! Yes, if you could point me to some of your examples, that would be great! Your function for calculating statistics over labeled regions helps already a lot.

@freemansw1 freemansw1 modified the milestones: Version 1.6, Version 1.5.x Jun 9, 2023
@JuliaKukulies JuliaKukulies changed the base branch from RC_v1.5.0 to RC_v1.5.x July 14, 2023 20:15
Copy link
Member

@freemansw1 freemansw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't run this fully yet on all the datasets I want to, but I am very happy with this so far! Nice addition!

doc/index.rst Outdated Show resolved Hide resolved
doc/segmentation_output.rst Outdated Show resolved Hide resolved
tobac/utils/general.py Show resolved Hide resolved
)
if bins[i] > bins[i - 1]
else default
for i in index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A longer-term comment: it may be worth subsetting before calculating statistics. My guess is that it would be faster, but I'm not quite sure. Not something to hold up this PR, though!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you clarify what you mean by subsetting in this case? :) Are you referring to the indices?

@JuliaKukulies
Copy link
Member Author

From discussion regarding type hints in #337, #341 we should make sure to add from __future__ import annotations and possibly consider using from typing import Union rather than | for union type hints to ensure back-compatibility with python versions before 3.10

I changed to Union for this PR

@JuliaKukulies
Copy link
Member Author

@w-k-jones and @freemansw1 I think I addressed all your comments, so it would be nice if you could re-review whenever it is convenient for you! After we merge this PR, I will do ahead and prepare the 1.5.2 release with #337, #351 and #341 as we discussed in the last dev meeting

Copy link
Member

@w-k-jones w-k-jones left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small change to replicate previous behaviour for the ncells when no segment is present, otherwise happy to merge

tobac/segmentation.py Outdated Show resolved Hide resolved
Set default value for ncells as 0 instead of None to preserve behaviour from the previous implementation

Co-authored-by: William Jones <[email protected]>
Copy link
Member

@freemansw1 freemansw1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice change, @JuliaKukulies ! I've tested on a variety of datasets and seem to get reasonable results. I'm extremely happy that we're getting these changes in now.

@JuliaKukulies
Copy link
Member Author

Very nice change, @JuliaKukulies ! I've tested on a variety of datasets and seem to get reasonable results. I'm extremely happy that we're getting these changes in now.

Thank you very much for checking this new feature with different datasets! I will go ahead and merge when the jupyter tests pass (which should be the case now that I have merged Wills last PR with the zenodo path changes)

@w-k-jones
Copy link
Member

Just found an annoying bug with missing segments in which all the features which don't have a label in the segment mask before the first present label will get the value None instead of the default. Will have a look into a quick fix...

@JuliaKukulies
Copy link
Member Author

JuliaKukulies commented Nov 7, 2023

Just found an annoying bug with missing segments in which all the features that don't have a label in the segment mask before the first present label will get the value None instead of the default. Will have a look into a quick fix...

Thanks for these fixes @w-k-jones ! And ha, of course, I found and fixed another bug that caused the failing of the
notebook check for Example_Track_on_Radar_Segment_on_Satellite/track_on_radar_segment_on_satellite.ipynb. The issue was that utils.transform_featurepoints uses the irispandas_to_xarray decorator and xr.DataArray.where() which converts the feature values from integers to floats (apparently this is intended behavior, but not very intuitive). Something to remember for the xarray transition! We had this issue in the notebook already before this PR but now the segmentation step raises an exception because utils.get_statistics() needs the feature IDs to be integers.

Do you want to have another quick look @freemansw1 @w-k-jones or are we ready to merge?

And for the future: Would it make sense to implement tests that check the datatypes of the output data frames at least in all functions that modify the data frames?

@freemansw1
Copy link
Member

Do you want to have another quick look @freemansw1 @w-k-jones or are we ready to merge?

I am happy for this to be merged.

And for the future: Would it make sense to implement tests that check the datatypes of the output data frames at least in all functions that modify the data frames?

Yes, we should probably introduce those as part of our typical suite.

@JuliaKukulies JuliaKukulies merged commit 85f8f3a into tobac-project:RC_v1.5.x Nov 7, 2023
7 checks passed
@JuliaKukulies JuliaKukulies mentioned this pull request Nov 8, 2023
8 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Addition of new features, or improved functionality of existing features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants