Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAS-1177: Add functions and CF overrides to create artificial bounds … #5

Merged
merged 11 commits into from
Mar 14, 2024
Merged
Show file tree
Hide file tree
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,18 @@
## v1.0.2
### 2024-2-26

This version of HOSS correctly handles edge-aligned geographic collections by
adding the attribute `cell_alignment` with the value `edge` to `hoss_config.json`
for edge-aligned collections (namely, ATL16), and by adding functions that
create pseudo bounds for edge-aligned collections to make HOSS use the
`dimension_utilities.py` function, `get_dimension_indices_from_bounds`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that these are our public release notes, it might be worth explicitly stating in some way that the pseudo-bounds variables are not returned in the actual HOSS output.


This change also includes an addition of a CF override that addresses an
issue with the ATL16 metadata for the variables `/spolar_asr_obs_grid` and
`/spolar_lorate_blowing_snow_freq` where their `grid_mapping` attribute points
to north polar variables instead of south polar variables. This CF Override
will have to be removed if/when the metadata is corrected.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: It's a bit of a strong statement to say it will have to be removed. It can be (and I think it would be good to do so), but it's unlikely that it would break anything by having an override that has the same information as the attribute it's overriding.


## v1.0.1
### 2023-12-19

Expand Down
2 changes: 1 addition & 1 deletion docker/service_version.txt
Original file line number Diff line number Diff line change
@@ -1 +1 @@
1.0.1
1.0.2
148 changes: 145 additions & 3 deletions hoss/dimension_utilities.py
Original file line number Diff line number Diff line change
Expand Up @@ -12,14 +12,15 @@
from logging import Logger
from typing import Dict, Set, Tuple

from pathlib import PurePosixPath
from netCDF4 import Dataset
from numpy.ma.core import MaskedArray
import numpy as np

from harmony.message import Message
from harmony.message_utility import rgetattr
from harmony.util import Config
from varinfo import VarInfoFromDmr
from varinfo import VarInfoFromDmr, VariableFromDmr

from hoss.bbox_utilities import flatten_list
from hoss.exceptions import InvalidNamedDimension, InvalidRequestedRange
Expand Down Expand Up @@ -75,8 +76,149 @@ def prefetch_dimension_variables(opendap_url: str, varinfo: VarInfoFromDmr,

logger.info('Variables being retrieved in prefetch request: '
f'{format_variable_set_string(required_dimensions)}')
return get_opendap_nc4(opendap_url, required_dimensions, output_dir,
logger, access_token, config)

required_dimensions_nc4 = get_opendap_nc4(opendap_url,
required_dimensions, output_dir,
logger, access_token, config)

# Create bounds variables if necessary.
add_bounds_variables(required_dimensions_nc4, required_dimensions,
varinfo, logger)

return required_dimensions_nc4


def add_bounds_variables(dimensions_nc4: str,
required_dimensions: Set[str],
varinfo: VarInfoFromDmr,
logger: Logger) -> None:
""" Augment a NetCDF4 file with artificial bounds variables for each
dimension variable that is edge-aligned and does not already
have bounds variables.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Total nitpick, but it might be worth adding to this first paragraph that the need for adding artificial bounds variables is identified by the earthdata-varinfo configuration file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that I should say we're making these bounds by adding the CF override to hoss_config.json, or something more directly in earthdata-varinfo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the latter - i.e., Augment a NetCDF4 file with artificial bounds variables for each dimension variable that is edge-aligned and does not already have bounds variables. + "edge-aligned" and "has-bounds" are defined by earthdata-varinfo attributes, which come from the netcdf4 variable attributes, or if necessary by overrides in the configuration file

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might slightly disagree with some of the text above. edge-aligned doesn't appear in any NetCDF-4 file, because it's something we've coined (and really wish was in the NetCDF-4 file). Also (yet another nit) the metadata attribute for bounds is just bounds not has-bounds.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But the cell_alignment/edge-aligned attribute only exists in hoss_config.json, right? Does that make it an earthdata-varinfo attribute? Like,

"Augment a NetCDF4 file with artificial bounds variables for each dimension variable that has been identified by hoss_config.json to have an edge-aligned attribute"?

Or instead of hoss_config.json, we say "earthdata-varinfo configuration file" as a catch-all for all applications?

(I may over-complicating this, I'm sorry!)


For each dimension variable:
(1) Check if the variable needs a bounds variable.
(2) If so, create a bounds array from within the `write_bounds` function.
(3) Then write the bounds variable to the NetCDF4 URL.

"""
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
with Dataset(dimensions_nc4, 'r+') as prefetch_dataset:
for dimension_name in required_dimensions:
dimension_variable = varinfo.get_variable(dimension_name)
if needs_bounds(dimension_variable):
write_bounds(prefetch_dataset, dimension_variable)

logger.info('Artificial bounds added for dimension variable: '
f'{dimension_name}')


def needs_bounds(dimension: VariableFromDmr) -> bool:
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved
""" Check if a dimension variable needs a bounds variable.
This will be the case when dimension cells are edge-aligned
and bounds for that dimension do not already exist.

"""
return (
dimension.attributes.get('cell_alignment') == 'edge'
and dimension.references.get('bounds') is None
)


def get_bounds_array(prefetch_dataset: Dataset,
dimension_path: str) -> np.ndarray:
""" Create an array containing the minimum and maximum bounds
for each pixel in a given dimension.

The minimum and maximum values are determined under the assumption
that the dimension data is monotonically increasing and contiguous.
So for every bounds but the last, the bounds are simply extracted
from the dimension dataset.

The final bounds must be calculated with the assumption that
the last data cell is edge-aligned and thus has a value the does
not account for the cell length. So, the final bound is determined
by taking the median of all the resolutions in the dataset to obtain
a resolution that can be added to the final data value.

Ex: Input dataset with resolution of 3 degrees: [ ... , 81, 84, 87]

Minimum | Maximum
<...> <...>
81 84
84 87
87 ? -> 87 + median resolution -> 87 + 3 -> 90
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brilliant example here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a dumb question about edge-alignment.

Is it always the case that when something is edge aligned, the values are such that they represent the left side of the gridcell? or in the case of a vertical y-positive axis dimension, the bottom?

I'm probably confusing this with something else, but I can't find the reference at all.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know that it's always the case, but I think that it's what we've seen. I tried to find a quick reference, but couldn't.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ATL16, the origin is bottom left. That's why. Most products are top left


"""
# Access the dimension variable's data using the variable's full path.
dimension_array = prefetch_dataset[dimension_path][:]

median_resolution = np.median(np.diff(dimension_array))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the dimension array have constant differences between cells?
i.e. Value2 - Value1 == Valuen+1 - ValueN for all N?

If so, this is probably doing more math then we need. we can just compute the difference between any two cells. If not, then this won't be precisely accurate, is this "good enough" for subsetting that we're going for?

Copy link
Collaborator Author

@lyonthefrog lyonthefrog Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it always the case that when something is edge aligned, the values are such that they represent the left side of the gridcell?

No, I'm pretty sure any type of edge alignment is possible. We decided to stick with handling the lower left alignment case since that's ATL16's alignment and accounting for all the possible alignments could balloon very quickly. I think we're thinking/hoping that this is the most common edge-alignment too. I might have just made some of that up though, what say you @owenlittlejohns?

All that being said, this should probably be documented in the code, huh.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't the dimension array have constant differences between cells?

@autydp and @owenlittlejohns looked over this calculation with me, and they figured the cell differences might not always be constant. So yes, what I currently have here is a "good enough". They can weigh in though!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might still argue that the (N-2, N-1) difference is a better predictor of the (N-1, N) difference than the entire median, but I will not push it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main thing here is that there can be a bit of floating-point level variation. Maybe that's not super important for the very last pixel, but it's bitten us before at times for requests with bounding boxes hitting the exact boundary between pixels. My preference for the median diff comes from a combination of that, and that it's still all a pretty straightforward one-liner to get it with numpy.


# This array is the transpose of what is required, just for easier assignment
# of values (indices are [row, column]) during the bounds calculations:
cell_bounds = np.zeros(shape=(2, dimension_array.size), dtype=dimension_array.dtype)

# Minimum values are equal to the dimension pixel values (for lower left pixel alignment):
cell_bounds[0] = dimension_array[:]

# Maximum values are the next dimension pixel values (for lower left pixel alignment),
# so these values almost mirror the minimum values but start at the second pixel
# instead of the first. Here we calculate each bound except for the very last one.
cell_bounds[1][:-1] = dimension_array[1:]

# Last maximum value is the last pixel value (minimum) plus the median resolution:
cell_bounds[1][-1] = dimension_array[-1] + median_resolution

# Return transpose of array to get correct shape:
return cell_bounds.T


def write_bounds(prefetch_dataset: Dataset,
dimension_variable: VariableFromDmr) -> None:
""" Write the input bounds array to a given dimension dataset.

First a new dimension is created for the new bounds variable
to allow the variable to be two-dimensional.

Then the new bounds variable is created using two dimensions:
(1) the existing dimension of the dimension dataset, and
(2) the new bounds variable dimension.

"""
# Create the bounds array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Feels like a superfluous comment.

bounds_array = get_bounds_array(prefetch_dataset,
dimension_variable.full_name_path)

# Create the second bounds dimension.
dimension_name = str(PurePosixPath(dimension_variable.full_name_path).name)
dimension_group = str(PurePosixPath(dimension_variable.full_name_path).parent)
bounds_name = dimension_name + '_bnds'
bounds_full_path_name = dimension_variable.full_name_path + '_bnds'
bounds_dimension_name = dimension_name + 'v'
# Consider the special case when the dimension group is the root directory.
# The dimension can't refer to the full path in the name itself, so we have
# to create it with respect to the group we want to place it in.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Totally just a) Nitpick: Maybe this comment could be truncated a tiny bit, and as it is specific to the first condition in the if/else block, it could probably go in that block itself:

if dimension_group == '/':
    # This is the comment specifically about this block...
    bounds_dim = ...
else:
    bounds_dim = ...

if dimension_group == '/':
bounds_dim = prefetch_dataset.createDimension(bounds_dimension_name, 2)
else:
bounds_dim = prefetch_dataset[dimension_group].createDimension(bounds_dimension_name, 2)

# Dimension variables only have one dimension - themselves.
variable_dimension = prefetch_dataset[dimension_variable.full_name_path].dimensions[0]

bounds_data_type = str(dimension_variable.data_type)
bounds = prefetch_dataset.createVariable(bounds_full_path_name,
bounds_data_type,
(variable_dimension,
bounds_dim,))

# Write data to the new variable in the prefetch dataset.
bounds[:] = bounds_array[:]

# Update varinfo attributes and references.
prefetch_dataset[dimension_variable.full_name_path].setncatts({'bounds': bounds_name})
dimension_variable.references['bounds'] = {bounds_name, }
dimension_variable.attributes['bounds'] = bounds_name
owenlittlejohns marked this conversation as resolved.
Show resolved Hide resolved


def is_dimension_ascending(dimension: MaskedArray) -> bool:
Expand Down
27 changes: 27 additions & 0 deletions hoss/hoss_config.json
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,20 @@
],
"_Description": "Ensure variables in /Soil_Moisture_Retrieval_Data_Polar_PM group point to correct coordinate variables."
},
{
"Applicability": {
"Mission": "ICESat2",
"ShortNamePath": "ATL16",
"Variable_Pattern": ".*_grid_(lat|lon)"
},
"Attributes": [
{
"Name": "cell_alignment",
"Value": "edge"
}
],
"_Description": "ATL16 has edge-aligned grid cells."
},
{
"Applicability": {
"Mission": "ICESat2",
Expand Down Expand Up @@ -357,6 +371,19 @@
}
],
"_Description": "Ensure the latitude and longitude dimension variables know their associated grid_mapping variable."
},
{
"Applicability": {
"Mission": "ICESat2",
"ShortNamePath": "ATL16",
"Variable_Pattern": "/spolar_(asr_obs_grid|lorate_blowing_snow_freq)"
},
"Attributes": [
{
"Name": "grid_mapping",
"Value": "crs_latlon: spolar_grid_lat crs_latlon: spolar_grid_lon"
}
]
Copy link
Collaborator Author

@lyonthefrog lyonthefrog Feb 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This entry addresses what we think is a mistake in the ATL16 v004 data for the two indicated variables, where the grid_mapping attribute references north polar dimensions instead of south polar dimensions. Without this CF override, all the subsets that include either of these variables would fail due to the existing bug captured in DAS-1804.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, that grid_mapping value is referencing the crs_latlon grid-mapping variable for both horizontal dimensions (lat, lon), here explicitly identified in the grid_mapping reference value. A simpler entry would be, I think, to specify "crs_latlon" as a grid_mapping variable that implicitly applies to all dimensions. This posted override is in keeping with the other ATL16 grid_mapping reference attributes that explicitly call out the dimensional applicability.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Totally agree that crs_latlon is simpler, but it seems best to match the format of the other entries for grid_mapping that NSIDC is using (which is still a valid format per the CF Conventions).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I almost wonder if HOSS should be a bit less "diligent" in reading those attributes and instead use other heuristics (likely already in use) for identifying the relevant horizontal dimensions for grid mapping references. They seem superfluous in this case, and then they got them wrong!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it HOSS reading those attributes, or varinfo? I saw that varinfo specifically checks the grid_mapping attribute if one exists.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point @lyonthefrog - it is VarInfoFromDmr doing the reading. And I think we want to keep earthdata-varinfo as aligned to the standard (CF Conventions) as possible. This format of representation is valid per the CF Conventions, so I think we're okay. The only intrinsic error here was that the metadata was incorrect in the file. (Quirky Data!)

}
],
"CF_Supplements": [
Expand Down
Loading
Loading