Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-1177: Add functions and CF overrides to create artificial bounds … #5

Merged
merged 11 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
## v1.0.2
### 2024-2-26

This version of HOSS correctly handles edge-aligned geographic collections by
adding the attribute `cell_alignment` with the value `edge` to `hoss_config.json`
for edge-aligned collections (namely, ATL16), and by adding functions that
create pseudo bounds for edge-aligned collections to make HOSS use the
`dimension_utilities.py` function, `get_dimension_indices_from_bounds`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these are our public release notes, it might be worth explicitly stating in some way that the pseudo-bounds variables are not returned in the actual HOSS output.


This change also includes an addition of a CF override that addresses an
issue with the ATL16 metadata for the variables `/spolar_asr_obs_grid` and
`/spolar_lorate_blowing_snow_freq` where their `grid_mapping` attribute points
to north polar variables instead of south polar variables. This CF Override
will have to be removed if/when the metadata is corrected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's a bit of a strong statement to say it will have to be removed. It can be (and I think it would be good to do so), but it's unlikely that it would break anything by having an override that has the same information as the attribute it's overriding.


## v1.0.1
### 2023-12-19

Expand Down
2 changes: 1 addition & 1 deletion docker/service_version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.1
1.0.2
141 changes: 138 additions & 3 deletions hoss/dimension_utilities.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
from harmony.message import Message
from harmony.message_utility import rgetattr
from harmony.util import Config
from varinfo import VarInfoFromDmr
from varinfo import VarInfoFromDmr, VariableFromDmr

from hoss.bbox_utilities import flatten_list
from hoss.exceptions import InvalidNamedDimension, InvalidRequestedRange
Expand Down Expand Up @@ -75,8 +75,143 @@ def prefetch_dimension_variables(opendap_url: str, varinfo: VarInfoFromDmr,

logger.info('Variables being retrieved in prefetch request: '
f'{format_variable_set_string(required_dimensions)}')
return get_opendap_nc4(opendap_url, required_dimensions, output_dir,
logger, access_token, config)

required_dimensions_nc4 = get_opendap_nc4(opendap_url,
required_dimensions, output_dir,logger, access_token, config)

# Create bounds variables if necessary.
add_bounds_variables(required_dimensions_nc4, required_dimensions,
varinfo, logger)

return required_dimensions_nc4


def add_bounds_variables(dimensions_nc4: str,
required_dimensions: Set[str],
varinfo: VarInfoFromDmr,
logger: Logger) -> None:
""" Augment a NetCDF4 file with artificial bounds variables for each
dimension variable that is edge-aligned and does not already
have bounds variables.

For each dimension variable:
(1) Check if the variable needs a bounds variable.
(2) If so, create a bounds array of minimum and maximum values.
(3) Then write the bounds variable to the NetCDF4 URL.

"""
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
with Dataset(dimensions_nc4, 'r+') as datasets:
for dimension_name in required_dimensions:
dimension_variable = varinfo.get_variable(dimension_name)
if needs_bounds(dimension_variable) is True:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is more pythonic to just test the returned value here:

Suggested change
if needs_bounds(dimension_variable) is True:
if needs_bounds(dimension_variable):

ref: https://peps.python.org/pep-0008/#programming-recommendations

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for attaching the reference! I see that one there.

min_and_max_bounds = create_bounds(datasets, dimension_name)
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
write_bounds(datasets, dimension_variable, min_and_max_bounds)

logger.info('Artificial bounds added for dimension variable: '
f'{dimension_name}')


def needs_bounds(dimension: VariableFromDmr) -> bool:
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
""" Check if a dimension variable needs a bounds variable.
This will be the case when dimension cells are edge-aligned
and bounds for that dimension do not already exist.

"""

return dimension.attributes['cell_alignment'] == 'edge' and dimension.references.get('bounds') == None
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved


def create_bounds(dimension_dataset: Dataset,
dimension_path: str) -> np.ndarray:
""" Create an array containing the minimum and maximum bounds
for a given dimension.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - great level of description. If I was being a pedant (which I usually am), then it might be good to specifically say these are the minimum and maximum bounds for each pixel in a given dimension.


The minimum and maximum values are determined under the assumption
that the dimension data is monotonically increasing and contiguous.
So for every bounds but the last, the bounds are simply extracted
from the dimension dataset.

The final bounds must be calculated with the assumption that
the last data cell is edge-aligned and thus has a value the does
not account for the cell length. So, the final bound is determined
by taking the median of all the resolutions in the dataset to obtain
a resolution that can be added to the final data value.

Ex: Input dataset with resolution of 3 degrees: [ ... , 81, 84, 87]

Minimum | Maximum
<...> <...>
81 84
84 87
87 ? -> 87 + median resolution -> 87 + 3 -> 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant example here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a dumb question about edge-alignment.

Is it always the case that when something is edge aligned, the values are such that they represent the left side of the gridcell? or in the case of a vertical y-positive axis dimension, the bottom?

I'm probably confusing this with something else, but I can't find the reference at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it's always the case, but I think that it's what we've seen. I tried to find a quick reference, but couldn't.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ATL16, the origin is bottom left. That's why. Most products are top left


"""
# Access the dimension variable's data using the variable's full path.
dimension_data = dimension_dataset[dimension_path][:]

# Determine the dimension's resolution by taking the median value
# of the differences between each ascending data value.
dimension_array = np.array(dimension_data)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused - isn't dimension_array exactly the same thing as dimension_data? I might be missing something but I thought using netcdf4_variable[:] just returned you the numpy.ndarray of the data stored in the variable.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You appear to be correct. Removing!


# Build array.
size = dimension_array.size
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like another redundant variable - just use dimension_array.size where you need it. (Also dimension_array.size is a bit more descriptive than just size)

min_max_pairs = [[dimension_array[idx],
dimension_array[idx+1]]
for idx in range(0, size-1)]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It feels like you could avoid using lists at all and just go straight for a numpy.array:

Something like:

median_resolution = np.median(np.diff(dimension_array))

# Array is transpose of what is required for easier assignment of values (indices are [row, column]):
cell_bounds = np.zeros(shape=(2, dimension_array.size), dtype=dimension_array.dtype)

# Minimum values are equal to dimension pixel values (left pixel alignment):
cell_bounds[0] = dimension_array[:]

# Maximum values are the next dimension pixel values (left pixel alignment):
cell_bounds[1][:-1] = dimension_array[1:]

# Last maximum value is last pixel value (minimum) plus median resolution:
cell_bounds[1][-1] = dimension_array[-1] + median_resolution

# Return transpose of array to get correct shape:
return cell_bounds.T

The transpose fun-and-games is a bit of a faff, but I think it makes the initial assignment easier, but then you need to swap back to the expected shape of (N, 2), rather than (2, N).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took a moment to review this and test it out, and it looks like a simple, good approach. Neat!


# Calculate final values.
dim_resolution = np.median(np.diff(dimension_array))
min_max_pairs.append([dimension_array[size-1],
dimension_array[size-1] + dim_resolution])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this is tested, so I'm suspicious of this observation, but... On line 160 above the min_max_pairs are being populated for the range (0 .. size-1). And then, here, an additional value is being populated. Wouldn't that be one beyond what we want to populate? Also on line 159, referencing dimension_array[idx+1] for idx = (size-1) - wouldn't that be one beyond the range of dimension_array?

Copy link
Collaborator Author

@lyonthefrog lyonthefrog Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right! python ranges only include up until the second value, but don't include that value itself. So the final idx value in the min_max_pairs array loop will be size-2, since range(0, size-1).

So if the size of the dimension array is, say, 60, the loop will write pair values up to index 58, then the final value calculation adds the [dimension_array[60-1], dimension_array[60-1] + res] which amounts to a final index of 59.


return np.array(min_max_pairs)


def write_bounds(dimension_dataset: Dataset,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably sounds like a nit, but it's something that confused me reading this code over:

Can we use a different variable name to dimension_dataset? Maybe prefetch_dataset? I got myself in a tangle thinking this was specifically a single dimension, rather than the whole NetCDF-4 file retrieved from OPeNDAP. (This change would also apply to create_bounds)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh that is more descriptive. Thank you!

dimension_variable: VariableFromDmr,
min_and_max_bounds: np.ndarray) -> None:
""" Write the input bounds array to a given dimension dataset.

First a new dimension is created for the new bounds variable
to allow the variable to be two-dimensional.

Then the new bounds variable is created using two dimensions:
(1) the existing dimension of the dimension dataset, and
(2) the new bounds variable dimension.

"""
# Create the second bounds dimension.
dimension_full_name_path = dimension_variable.full_name_path
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
dimension_group = '/' + '/'.join(dimension_full_name_path.split('/')[1:-1])
dimension_name = dimension_full_name_path.split('/')[-1]

# Consider the special case when the dimension group is the root directory.
# The dimension can't refer to the full path in the name itself, so we have
# to create it with respect to the group we want to place it in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Totally just a) Nitpick: Maybe this comment could be truncated a tiny bit, and as it is specific to the first condition in the if/else block, it could probably go in that block itself:

if dimension_group == '/':
    # This is the comment specifically about this block...
    bounds_dim = ...
else:
    bounds_dim = ...

if dimension_group == '/':
bounds_dim = dimension_dataset.createDimension(dimension_name + '_bnds_dim', 2)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The standard naming convention is generally something like latv, lonv, or nv (for lat, lon and time dimensions). Probably the best thing would be to call this dimension dimension_name + 'v'.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, because this string is used in a couple of places, you should probably declare it as a variable and then refer to that variable. (Just so that if the definition changes in the future, you only have to update one place)

else:
bounds_dim = dimension_dataset[dimension_group].createDimension(dimension_name + '_bnds_dim', 2)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears the createDimension function in the Python netCDF4 library doesn't recognize full path names, so I placed the dimension in the right group via the group reference in front of the function. However, this doesn't appear to work when the dimension is in the root directory.

There's gotta be a better way to do this, but I wrestled with the netCDF4 for a while to no avail.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah - this is frustrating. I didn't find a quick alternative. Maybe there's a cleverer way to split the dimension basename off the group with sting manipulation above, but it seems like the logic you have here to do one thing for the root group, and another for a nested one is the right way to go. (It's also really frustrating that dimension_dataset['/'] doesn't work like other groups)


# Dimension variables only have one dimension - themselves.
variable_dimension = dimension_dataset[dimension_full_name_path].dimensions[0]

bounds_data_type = str(dimension_variable.data_type)
bounds = dimension_dataset.createVariable(dimension_full_name_path + '_bnds',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar comment about reusing the bounds name - you define it on L211 - could you move that declaration up and use it here, too. (Or the base name from that declaration, at least)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I update this like you were imagining? Since the last two variable names were independent of each other I wasn't entirely sure.

bounds_data_type,
(variable_dimension,
bounds_dim,))
# Write data to dataset file.
size = len(min_and_max_bounds)
for idx in range(0, size):
bounds[idx, 0] = (min_and_max_bounds[idx])[0]
bounds[idx, 1] = (min_and_max_bounds[idx])[1]
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved

# Update varinfo attributes and references.
bounds_name = dimension_name + '_bnds'
dimension_dataset[dimension_full_name_path].setncatts({'bounds': bounds_name})
dimension_variable.references['bounds'] = {bounds_name,}
dimension_variable.attributes['bounds'] = bounds_name
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved


def is_dimension_ascending(dimension: MaskedArray) -> bool:
Expand Down
27 changes: 27 additions & 0 deletions hoss/hoss_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,20 @@
],
"_Description": "Ensure variables in /Soil_Moisture_Retrieval_Data_Polar_PM group point to correct coordinate variables."
},
{
"Applicability": {
"Mission": "ICESat2",
"ShortNamePath": "ATL16",
"Variable_Pattern": ".*_grid_(lat|lon)"
},
"Attributes": [
{
"Name": "cell_alignment",
"Value": "edge"
}
],
"_Description": "ATL16 has edge-aligned grid cells."
},
{
"Applicability": {
"Mission": "ICESat2",
Expand Down Expand Up @@ -357,6 +371,19 @@
}
],
"_Description": "Ensure the latitude and longitude dimension variables know their associated grid_mapping variable."
},
{
"Applicability": {
"Mission": "ICESat2",
"ShortNamePath": "ATL16",
"Variable_Pattern": "/spolar_(asr_obs_grid|lorate_blowing_snow_freq)"
},
"Attributes": [
{
"Name": "grid_mapping",
"Value": "crs_latlon: spolar_grid_lat crs_latlon: spolar_grid_lon"
}
]
Copy link
Collaborator Author

@lyonthefrog lyonthefrog Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry addresses what we think is a mistake in the ATL16 v004 data for the two indicated variables, where the grid_mapping attribute references north polar dimensions instead of south polar dimensions. Without this CF override, all the subsets that include either of these variables would fail due to the existing bug captured in DAS-1804.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, that grid_mapping value is referencing the crs_latlon grid-mapping variable for both horizontal dimensions (lat, lon), here explicitly identified in the grid_mapping reference value. A simpler entry would be, I think, to specify "crs_latlon" as a grid_mapping variable that implicitly applies to all dimensions. This posted override is in keeping with the other ATL16 grid_mapping reference attributes that explicitly call out the dimensional applicability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that crs_latlon is simpler, but it seems best to match the format of the other entries for grid_mapping that NSIDC is using (which is still a valid format per the CF Conventions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost wonder if HOSS should be a bit less "diligent" in reading those attributes and instead use other heuristics (likely already in use) for identifying the relevant horizontal dimensions for grid mapping references. They seem superfluous in this case, and then they got them wrong!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it HOSS reading those attributes, or varinfo? I saw that varinfo specifically checks the grid_mapping attribute if one exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @lyonthefrog - it is VarInfoFromDmr doing the reading. And I think we want to keep earthdata-varinfo as aligned to the standard (CF Conventions) as possible. This format of representation is valid per the CF Conventions, so I think we're okay. The only intrinsic error here was that the metadata was incorrect in the file. (Quirky Data!)

}
],
"CF_Supplements": [
Expand Down
Loading