NFC normalize strings? #1379

ChrisBarker-NOAA · 2024-10-28T06:04:10Z

The NUG indicates that strings (dimension and variable names, anyway) should be NFC normalized.

"""
... names are normalized according to Unicode NFC normalization rules during encoding as UTF-8 for storing in the file header. This is necessary to ensure that gratuitous differences in the representation of Unicode names do not cause anomalies in comparing files and querying data objects by name.
"""

(and next CF release will specify NFC normalization for all text)

But as far as I can tell, netCDF4 isn't doing that. It probably should.

I think it may be as easy as adding:

import unicodedata
pystr = unicodedata.normalize('NFC', pystr)

to _strencode()

Granted -- this does mean that users may get something slightly different back when they round-trip a anme through netcdf.

If that's a concern, the you could call unicodedata.is_normalized, and raiae an error instead.

The text was updated successfully, but these errors were encountered:

jswhit · 2024-10-29T00:00:50Z

That section of the NUG only applies to netcdf classic, not HDF5. Plus, I read that as meaning that the library does that for you (so the python layer doesn't need to).

ChrisBarker-NOAA · 2024-10-29T00:40:52Z

Hmm -- I'm pretty sure that all variable and dimension names are supposed to be NFC normalized. The sectionof the NUG does talk about he Header, so yes, probably only vital for netcdf classic. But still a good idea, and CF will be requiring it anyway.

The search on the NUG is broken, so I'm having a hard time finding what I'm looking for :-(

The library does that for you,

I doubt it -- but worth a look. It would be great if it did.

I'll try to poke into it.

ChrisBarker-NOAA · 2024-10-29T01:35:15Z

OK -- I've poked into it, and you are completely correct -- the netCDF C lib is NFC normalizing variable names. Here's an experiment with netCDF4:

import  netCDF4
import unicodedata


normal_name = "composed\u00E9"

non_normal_name = "separate\u0065\u0301"

with netCDF4.Dataset("nfc-norm.nc", 'w') as ds:
    dim = ds.createDimension("a_dim", 10)
    var1 = ds.createVariable(normal_name, float, ("a_dim"))
    var2 = ds.createVariable(non_normal_name, float, ("a_dim"))
    var1[:] = range(10)
    var2[:] = range(10)


with netCDF4.Dataset("nfc-norm.nc", 'r') as ds:
    # get the vars from their original names
    try:
        norm = ds[normal_name]
        print(f"{normal_name} worked")
    except IndexError:
        print(f"{normal_name} didn't work")

    try:
        non_norm = ds[non_normal_name]
        print(f"{non_normal_name} worked")
    except IndexError:
        print(f"{non_normal_name} didn't work")
        non_norm = ds[unicodedata.normalize('NFC', non_normal_name)]
        print(f"But it  did once normalized!")

    for name in ds.variables.keys():
        assert unicodedata.is_normalized('NFC', name)
    print("All variable names are normalized")

And when run:

In [54]: run nfc_norm.py
composedé worked
separateé didn't work
But it  did once normalized!
All variable names are normalized

So indeed, the C lib is doing it for you -- nothing to be done here.

Except maybe a note in the docs ...

ChrisBarker-NOAA · 2024-10-29T18:41:55Z

Another potential issue -- not sure if this is something that should be built in to the lib:

The next version of CF will specify that attributes should be NFC normalized. This is because a number of CF attributes reference variable names, so they really need to be exact / compare equally.

I just tested, and string attributes are not being normalized.

So the netCDF4 lib could normalize attributes too.

(so could the C lib, but I'm guessing they won't want to go there -- it's not critical to netcdf itself)

ChrisBarker-NOAA closed this as completed Oct 29, 2024

ChrisBarker-NOAA reopened this Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NFC normalize strings? #1379

NFC normalize strings? #1379

ChrisBarker-NOAA commented Oct 28, 2024

jswhit commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

NFC normalize strings? #1379

NFC normalize strings? #1379

Comments

ChrisBarker-NOAA commented Oct 28, 2024

jswhit commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024

ChrisBarker-NOAA commented Oct 29, 2024