-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Let users optionally derive bulk statistics of the data points belonging to each feature #293
Let users optionally derive bulk statistics of the data points belonging to each feature #293
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## RC_v1.5.x #293 +/- ##
=============================================
+ Coverage 55.75% 56.35% +0.59%
=============================================
Files 15 16 +1
Lines 3316 3384 +68
=============================================
+ Hits 1849 1907 +58
- Misses 1467 1477 +10
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
Cool feature! Had a quick look through and it mostly looks good. Couple of questions/suggestions before I do a full review:
Most of these are beyond the scope of this PR, so more interesting thoughts raised by it than requirements! Also |
Thanks for the great feedback, @w-k-jones! I really like your suggestions and I think that most of them could actually be implemented as part of this PR. Given that this is for
Not entirely sure I understand what you mean here. Do you mean if it is possible to run as a postprocessing step rather than during the segmentation?
Note that we do the looping over features anyhow in the current code to get
Sure thing! Are you thinking just the bulk statistics for the feature with the most extreme threshold (since this is the one ending up in the output dataframe)? Or would it make more sense to get the statistics for the feature that was first detected (the least extreme) or all features within features?
Yes, good point, that makes sense! But we should allow for an arbitrary number of custom functions so that multiple statistics can be derived from one segmentation process, right?
Haha, yes definitely not ideal :) Hope that we can shorten this down or implement elsewhere following your points 1-4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @JuliaKukulies for adding this!
I have only found a few minor things that should be looked at again. All in all very good work!
Thanks for the quick and clear review, @lettlini ! Really good points - I addressed them in my latest commit. I might ask you for a re-review when I have considered some of the points that concern the main design made by @w-k-jones . I think some at least make sense to address in this PR |
Yes, sorry for being unclear! Running as a postprocessing step is indeed what I meant.
Here's the function I use for calculating statistics over labelled functions:
I think just the most extreme threshold, as it would be difficult to do anything else without more complex logic/actually performing segmentation as well!
Yes, we should allow for functions that return multiple values (so e.g. calculate mean and standard deviation), but also a list of different functions. This could be something added at a later date however
To be fair this is mainly due to the PBC and 3D changes, I just noticed it when looking at the code now! I'm sure with the switch to |
Great, thanks @w-k-jones for more helpful input and clarifications! Yes, if you could point me to some of your examples, that would be great! Your function for calculating statistics over labeled regions helps already a lot. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't run this fully yet on all the datasets I want to, but I am very happy with this so far! Nice addition!
tobac/utils/general.py
Outdated
) | ||
if bins[i] > bins[i - 1] | ||
else default | ||
for i in index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A longer-term comment: it may be worth subsetting before calculating statistics. My guess is that it would be faster, but I'm not quite sure. Not something to hold up this PR, though!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify what you mean by subsetting in this case? :) Are you referring to the indices?
@w-k-jones and @freemansw1 I think I addressed all your comments, so it would be nice if you could re-review whenever it is convenient for you! After we merge this PR, I will do ahead and prepare the 1.5.2 release with #337, #351 and #341 as we discussed in the last dev meeting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small change to replicate previous behaviour for the ncells
when no segment is present, otherwise happy to merge
Set default value for ncells as 0 instead of None to preserve behaviour from the previous implementation Co-authored-by: William Jones <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice change, @JuliaKukulies ! I've tested on a variety of datasets and seem to get reasonable results. I'm extremely happy that we're getting these changes in now.
Thank you very much for checking this new feature with different datasets! I will go ahead and merge when the jupyter tests pass (which should be the case now that I have merged Wills last PR with the zenodo path changes) |
Co-authored-by: William Jones <[email protected]>
Just found an annoying bug with missing segments in which all the features which don't have a label in the segment mask before the first present label will get the value |
…to floats and causes statistic calculation to fail
Thanks for these fixes @w-k-jones ! And ha, of course, I found and fixed another bug that caused the failing of the Do you want to have another quick look @freemansw1 @w-k-jones or are we ready to merge? And for the future: Would it make sense to implement tests that check the datatypes of the output data frames at least in all functions that modify the data frames? |
I am happy for this to be merged.
Yes, we should probably introduce those as part of our typical suite. |
Finally a PR that solves #153 by implementing an optional extraction of statistics on the data points belonging to each feature. This is done in the segmentation step and produces additional columns in the segmentation output dataframe, if called by the user.
Added tests and updated example notebook and documentation accordingly.
As discussed in #153, a follow-up on this after the transition to
xarray
will be to save not only the bulk statistics of features but to save the locations of all grid feature points. This is indirectly done in the mask files that can be combined with the actual data, but saving all feature data points would facilitate this analysis step for the user.test_get_statistics()
andtest_segmentation_multiple_features()
RC_v1.5.0