Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Let users optionally derive bulk statistics of the data points belonging to each feature #293

Merged
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
55 commits
Select commit Hold shift + click to select a range
4d0e9d2
added optional function to derive bulk statistics of segmented features
May 23, 2023
796d1e3
some fixes
May 23, 2023
58dca76
fixed the saving of percentiles for each feature in dataframe
May 25, 2023
fc636d6
black formatting
May 25, 2023
a3cd532
black formatting
May 25, 2023
d642596
added documentation of new output columns
Jun 5, 2023
0b28766
added csv
Jun 5, 2023
a585359
added tests, example, and modified documentation
Jun 5, 2023
3b570a4
merged in latest changes from RC_v1.5.0 and solved merge conflict
Jun 5, 2023
a265c16
and black formatting again
Jun 5, 2023
b577f22
modified example to avoid error when grouping pd.DataFrame
Jun 5, 2023
d359c44
addressed Kolyas review points
Jun 8, 2023
5dd3ba2
solve merge conflict and merge in latest changes v1.5.0
Aug 21, 2023
d81a4f2
changed get_statistics according to Wills approach
Aug 21, 2023
18b3cfd
allow to give functions that have array-like output or require input …
Aug 22, 2023
4942d9a
some fixes for implementation in segmentation_timestep
Aug 22, 2023
86dbabf
allow input parameter for all given functions and calculate ncells wi…
Aug 23, 2023
29f6830
fixes test_utils
Aug 23, 2023
3291179
added function for off-line calculation of bulk statistics
Aug 24, 2023
aefb6d6
implement bulk statistics in feature detection as well
Aug 25, 2023
165cc74
add example notebooks
Aug 25, 2023
ced3997
added two more example notebooks
Aug 28, 2023
0abc765
fixed test issues and added a better unit test
Aug 28, 2023
02037ab
black formatting
Aug 28, 2023
6712b35
fixed issue in test_bulk_statistics()
Aug 28, 2023
275ff9b
fixed notebook and test failure
Aug 28, 2023
7ac006f
formatting
Aug 28, 2023
e0cad23
Fix indexing of get_statistics results
w-k-jones Oct 7, 2023
cc8932a
merge in RC_v1.5.x and solve merge conflict in tobac/utils/__init__.py
Oct 9, 2023
21ab402
Update tobac/utils/general.py
JuliaKukulies Oct 11, 2023
708e7f2
Update tobac/utils/general.py
JuliaKukulies Oct 11, 2023
593d6d6
Update tobac/segmentation.py
JuliaKukulies Oct 11, 2023
7bdabe6
Fixed some of Wills suggestions
Oct 11, 2023
d39e1e6
Merge branch 'feature_statistics' of https://github.com/JuliaKukulies…
Oct 11, 2023
3f9d737
replace input parameter name for func_dict with 'statistic'
Oct 11, 2023
ce5c7d0
fixed tests and formatting
Oct 11, 2023
77a6d55
fixed example notebooks
Oct 11, 2023
296dafa
move code for statistics computation to end of feature detection outs…
Oct 11, 2023
e9c4212
avoid code duplicates by using functools.partial
Oct 11, 2023
26a1e02
black formatting
Oct 11, 2023
9c8b01b
added notebooks for RTD page
Oct 11, 2023
be432c4
fixed inpu parameter name change here as well
Oct 11, 2023
0f43cca
create submodule and fixed Union typehints
Oct 12, 2023
bb5d909
black formatting, fixed imports and init files for utils
Oct 12, 2023
9607b4d
addressed Seans comments for documentation
Oct 12, 2023
67b383f
fixed import as functions moved to new submodule
Oct 12, 2023
1c84353
Update tobac/segmentation.py
JuliaKukulies Oct 25, 2023
8c6d670
Update path to zenodo files in notebooks
w-k-jones Oct 25, 2023
838357f
Update tobac/utils/bulk_statistics.py
JuliaKukulies Nov 1, 2023
def442a
Fix initial features being missed and wrong dtype for result datafram…
w-k-jones Nov 2, 2023
8f8dea5
Fix setting with copy warnings
w-k-jones Nov 2, 2023
bccab49
Fix merge conflict
w-k-jones Nov 2, 2023
80ef32c
updated zenodo paths in these new notebooks as well
Nov 7, 2023
09adaf2
fixed bug in transform_featurepoints where feature IDs are converted …
Nov 7, 2023
55f1d66
formatting
Nov 7, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions doc/segmentation_out_vars_statistics.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
Variable Name,Description,Units,Type
feature_mean,Mean of feature data points ,same as input field,float
feature_max,Maximum value of feature data points ,same as input field,float
feature_min,Minimum value of feature data points ,same as input field,float
feature_sum,Sum of feature data points ,same as input field,float
major_axis_length,The length of the major axis of the ellipse that has the same normalized second central moments as the feature area,"number of grid cells, multiply by dx to get distance unit",float
feature_percentiles,Percentiles from 0 to 100 (with increment 1) of feature data distribution ,same as input field,ndarray
9 changes: 9 additions & 0 deletions doc/segmentation_output.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,12 @@ Segmentation also outputs the same `pandas` dataframe as obtained by Feature Det
:file: ./segmentation_out_vars.csv
:widths: 3, 35, 3, 3
:header-rows: 1

One can optionally get the bulk statistics of the data points belonging to each segmented feature (i.e. either the 2D area or the 3D volume assigned to the feature). This is done by setting `statistics=True` when calling :ufunc:`tobac.segmentation.segmentation` and will add the following columns to the output dataframe:
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved

.. csv-table:: tobac Segmentation Output Variables
:file: ./segmentation_out_vars_statistics.csv
:widths: 3, 35, 3, 3
:header-rows: 1

Note that these statistics refer to the data fields that are used as input for the segmentation. It is possible to run the segmentation with different input (see :doc:`transform segmentation`) data to get statistics of a feature based on different variables (e.g. get statistics of cloud top temperatures as well as rain rates for a certain storm object).
170 changes: 109 additions & 61 deletions examples/Example_Precip_Tracking/Example_Precip_Tracking.ipynb
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved

Large diffs are not rendered by default.

57 changes: 57 additions & 0 deletions tobac/segmentation.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@

import skimage
import numpy as np
import pandas as pd

from . import utils as tb_utils
from .utils import periodic_boundaries as pbc_utils
Expand Down Expand Up @@ -263,6 +264,7 @@ def segmentation_3D(
max_distance=None,
PBC_flag="none",
seed_3D_flag="column",
statistics=False,
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved
):
"""Wrapper for the segmentation()-function."""

Expand All @@ -277,6 +279,7 @@ def segmentation_3D(
max_distance=max_distance,
PBC_flag=PBC_flag,
seed_3D_flag=seed_3D_flag,
statistics=statistics,
)


Expand All @@ -291,6 +294,7 @@ def segmentation_2D(
max_distance=None,
PBC_flag="none",
seed_3D_flag="column",
statistics=False,
):
"""Wrapper for the segmentation()-function."""
return segmentation(
Expand All @@ -304,6 +308,7 @@ def segmentation_2D(
max_distance=max_distance,
PBC_flag=PBC_flag,
seed_3D_flag=seed_3D_flag,
statistics=statistics,
)


Expand All @@ -322,6 +327,7 @@ def segmentation_timestep(
seed_3D_size=5,
segment_number_below_threshold=0,
segment_number_unassigned=0,
statistics=False,
):
"""Perform watershedding for an individual time step of the data. Works
for both 2D and 3D data
Expand Down Expand Up @@ -383,6 +389,9 @@ def segmentation_timestep(
the marker to use to indicate a segmentation point is below the threshold.
segment_number_unassigned: int
the marker to use to indicate a segmentation point is above the threshold but unsegmented.
statistics: boolean, optional
Default is False. If True, bulk statistics for the data points assigned to each feature are saved in output.


Returns
-------
Expand Down Expand Up @@ -1011,6 +1020,47 @@ def segmentation_timestep(
row["feature"]
]

if statistics:
(
feature_mean,
feature_max,
feature_min,
feature_percentiles,
feature_sum,
feature_axis,
) = tb_utils.general.get_statistics(
row["feature"], segmentation_mask.copy(), field_in.data.copy()
)
# write the statistics to feature output dataframe
features_out.loc[
features_out.feature == row["feature"], "feature_mean"
] = feature_mean
features_out.loc[
features_out.feature == row["feature"], "feature_max"
] = feature_max
features_out.loc[
features_out.feature == row["feature"], "feature_min"
] = feature_min
features_out.loc[
features_out.feature == row["feature"], "feature_sum"
] = feature_sum
features_out.loc[
features_out.feature == row["feature"], "major_axis_length"
] = feature_axis

# save percentiles of data distribution within feature
if index == features_out[features_out.ncells > 0].index[0]:
# here, we need to initialize the column first since .loc indexing does not work with pd.Series
features_out["feature_percentiles"] = np.nan
# store numpy array with percentiles as cell in data frame
df = pd.DataFrame({"feature_percentiles": [feature_percentiles]})
# get row index rather than pd.Dataframe index value since we need to use .iloc indexing
row_idx = np.where(features_out.feature == row["feature"])[0]
features_out.iloc[
row_idx,
features_out.columns.get_loc("feature_percentiles"),
] = df.apply(lambda r: tuple(r), axis=1)
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved

return segmentation_out, features_out


Expand Down Expand Up @@ -1093,6 +1143,7 @@ def segmentation(
seed_3D_size=5,
segment_number_below_threshold=0,
segment_number_unassigned=0,
statistics=False,
):
"""Use watershedding to determine region above a threshold
value around initial seeding position for all time steps of
Expand All @@ -1114,6 +1165,9 @@ def segmentation(
dxy : float
Grid spacing of the input data.

statistics : boolean, optional
Default is False. If True, bulk statistics for the data points assigned to each feature are saved in output.

Output:
segmentation_out: iris.cube.Cube
Cloud mask, 0 outside and integer numbers according to track inside the cloud
Expand Down Expand Up @@ -1163,6 +1217,8 @@ def segmentation(
the marker to use to indicate a segmentation point is below the threshold.
segment_number_unassigned: int
the marker to use to indicate a segmentation point is above the threshold but unsegmented.
statistics: boolean, optional
Default is False. If True, bulk statistics for the data points assigned to each feature are saved in output.


Returns
Expand Down Expand Up @@ -1222,6 +1278,7 @@ def segmentation(
seed_3D_size=seed_3D_size,
segment_number_unassigned=segment_number_unassigned,
segment_number_below_threshold=segment_number_below_threshold,
statistics=statistics,
)
segmentation_out_list.append(segmentation_out_i)
features_out_list.append(features_out_i)
Expand Down
13 changes: 12 additions & 1 deletion tobac/tests/test_segmentation.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should add a test to check that the values output by statistics are correct

Original file line number Diff line number Diff line change
Expand Up @@ -620,13 +620,24 @@ def test_segmentation_multiple_features():

# perform segmentation
out_seg_mask, out_df = segmentation.segmentation_timestep(
field_in=test_data_iris, features_in=fd_output, dxy=test_dxy, threshold=1.5
field_in=test_data_iris,
features_in=fd_output,
dxy=test_dxy,
threshold=1.5,
statistics=True,
)
out_seg_mask_arr = out_seg_mask.core_data()

# assure that the number of grid cells belonging to each feature (ncells) are consistent with segmentation mask
assert int(out_df[out_df.feature == 1].ncells.values) == size_feature1
assert int(out_df[out_df.feature == 2].ncells.values) == size_feature2
# assure that bulk statistic columns are created in output
assert out_df.columns.size - fd_output.columns.size > 1
# assure that statistics are calculated everywhere where an area for ncells is found
assert (
out_df.ncells[out_df["ncells"] > 0].shape
== out_df.ncells[out_df["feature_mean"] > 0].shape
)


# TODO: add more tests to make sure buddy box code is run.
Expand Down
57 changes: 57 additions & 0 deletions tobac/tests/test_utils.py
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to split the 2D and 3D into separate tests?

Original file line number Diff line number Diff line change
Expand Up @@ -398,3 +398,60 @@ def test_combine_tobac_feats():
)
assert np.all(list(combined_feat["old_feat_column"].values) == [1, 1])
assert np.all(list(combined_feat["feature"].values) == [1, 2])


def test_get_statistics():
"""
Test to assure that bulk statistics for identified features are computed as expected.

"""
featureID = 1

### Test 2D data
data_2D = tb_test.make_simple_sample_data_2D()[10].data
data_2D[data_2D > 8] = 10
data_2D[data_2D < 8] = 0
segmentation_mask = data_2D.copy()
segmentation_mask[segmentation_mask > 8] = 1
segmentation_mask = segmentation_mask.astype(int)

# get bulk statistics of identified features
(
feature_mean,
feature_max,
feature_min,
feature_percentiles,
feature_sum,
feature_axis,
) = tb_utils.general.get_statistics(featureID, segmentation_mask, data_2D)

# expected results
assert feature_mean == feature_max == feature_min == 10
assert feature_percentiles.size == 101
assert np.unique(feature_percentiles)[0] == 10
assert feature_sum == 10 * data_2D[data_2D == 10].size

### Test 3D data
data_3D = tb_test.make_sample_data_3D_3blobs()[10].data

data_3D[data_3D > 8] = 10
data_3D[data_3D < 8] = 0
segmentation_mask_3D = data_3D.copy()
segmentation_mask_3D[segmentation_mask_3D > 8] = 1
segmentation_mask_3D = segmentation_mask_3D.astype(int)

# get bulk statistics
(
feature_mean,
feature_max,
feature_min,
feature_percentiles,
feature_sum,
feature_axis,
) = tb_utils.general.get_statistics(featureID, segmentation_mask_3D, data_3D)

# expected results
assert feature_mean == feature_max == feature_min == 10
assert feature_percentiles.size == 101
assert np.unique(feature_percentiles)[0] == 10
assert feature_sum == 10 * data_3D[data_3D == 10].size
60 changes: 60 additions & 0 deletions tobac/utils/general.py
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -551,3 +551,63 @@ def combine_tobac_feats(list_of_feats, preserve_old_feat_nums=None):
combined_sorted["feature"] = np.arange(1, len(combined_sorted) + 1)
combined_sorted = combined_sorted.reset_index(drop=True)
return combined_sorted


def get_statistics(feature_ID, segmentation_mask, field_in):
"""
Derive bulk statistics of all data point that are attributed to a certain feature
after segmentation.

Parameters
----------
feature_ID: int
The ID of a certain feature for which to extract the statistics.
segmentation_mask: ndarray
2D or 3D segmentation mask for the timestep wherein the feature occurs.
field_in: ndarray
2D or 3D field with data points for specific timestep (should have the same shape as the segmentation mask).
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved

Returns
-------

feature_mean: float
mean value of data points within feature
feature_max: float
max value of data points within feature
feature_min: float
min value of data points within feature
feature_percentiles: ndarray
percentiles from 0 to 100 of data distribution within feature
feature_sum: float
sum of all data points within feature (e.g. total precipitation)
feature_axis: float
length of major axis of feature

"""
from skimage.measure import regionprops

# get data points that belong to feature
data_points = field_in[segmentation_mask == feature_ID]

# get statistics for these data points
feature_mean = np.nanmean(data_points)
feature_max = np.nanmax(data_points)
feature_min = np.nanmin(data_points)
feature_percentiles = np.percentile(data_points, range(101))
JuliaKukulies marked this conversation as resolved.
Show resolved Hide resolved
feature_sum = np.nansum(data_points)

# get other region properties
segmentation_mask[
segmentation_mask != feature_ID
] = 0 # set segmentation mask for other features to 0
regions = regionprops(segmentation_mask)
feature_axis = regions[0].major_axis_length

return (
feature_mean,
feature_max,
feature_min,
feature_percentiles,
feature_sum,
feature_axis,
)