Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NFC normalize strings? #1379

Open
ChrisBarker-NOAA opened this issue Oct 28, 2024 · 4 comments
Open

NFC normalize strings? #1379

ChrisBarker-NOAA opened this issue Oct 28, 2024 · 4 comments

Comments

@ChrisBarker-NOAA
Copy link
Contributor

The NUG indicates that strings (dimension and variable names, anyway) should be NFC normalized.

"""
... names are normalized according to Unicode NFC normalization rules during encoding as UTF-8 for storing in the file header. This is necessary to ensure that gratuitous differences in the representation of Unicode names do not cause anomalies in comparing files and querying data objects by name.
"""

(and next CF release will specify NFC normalization for all text)

But as far as I can tell, netCDF4 isn't doing that. It probably should.

I think it may be as easy as adding:

import unicodedata
pystr = unicodedata.normalize('NFC', pystr)

to _strencode()

Granted -- this does mean that users may get something slightly different back when they round-trip a anme through netcdf.

If that's a concern, the you could call unicodedata.is_normalized, and raiae an error instead.

@jswhit
Copy link
Collaborator

jswhit commented Oct 29, 2024

That section of the NUG only applies to netcdf classic, not HDF5. Plus, I read that as meaning that the library does that for you (so the python layer doesn't need to).

@ChrisBarker-NOAA
Copy link
Contributor Author

Hmm -- I'm pretty sure that all variable and dimension names are supposed to be NFC normalized. The sectionof the NUG does talk about he Header, so yes, probably only vital for netcdf classic. But still a good idea, and CF will be requiring it anyway.

The search on the NUG is broken, so I'm having a hard time finding what I'm looking for :-(

The library does that for you,

I doubt it -- but worth a look. It would be great if it did.

I'll try to poke into it.

@ChrisBarker-NOAA
Copy link
Contributor Author

OK -- I've poked into it, and you are completely correct -- the netCDF C lib is NFC normalizing variable names. Here's an experiment with netCDF4:

import  netCDF4
import unicodedata


normal_name = "composed\u00E9"

non_normal_name = "separate\u0065\u0301"

with netCDF4.Dataset("nfc-norm.nc", 'w') as ds:
    dim = ds.createDimension("a_dim", 10)
    var1 = ds.createVariable(normal_name, float, ("a_dim"))
    var2 = ds.createVariable(non_normal_name, float, ("a_dim"))
    var1[:] = range(10)
    var2[:] = range(10)


with netCDF4.Dataset("nfc-norm.nc", 'r') as ds:
    # get the vars from their original names
    try:
        norm = ds[normal_name]
        print(f"{normal_name} worked")
    except IndexError:
        print(f"{normal_name} didn't work")

    try:
        non_norm = ds[non_normal_name]
        print(f"{non_normal_name} worked")
    except IndexError:
        print(f"{non_normal_name} didn't work")
        non_norm = ds[unicodedata.normalize('NFC', non_normal_name)]
        print(f"But it  did once normalized!")

    for name in ds.variables.keys():
        assert unicodedata.is_normalized('NFC', name)
    print("All variable names are normalized")

And when run:

In [54]: run nfc_norm.py
composedé worked
separateé didn't work
But it  did once normalized!
All variable names are normalized

So indeed, the C lib is doing it for you -- nothing to be done here.

Except maybe a note in the docs ...

@ChrisBarker-NOAA
Copy link
Contributor Author

Another potential issue -- not sure if this is something that should be built in to the lib:

The next version of CF will specify that attributes should be NFC normalized. This is because a number of CF attributes reference variable names, so they really need to be exact / compare equally.

I just tested, and string attributes are not being normalized.

So the netCDF4 lib could normalize attributes too.

(so could the C lib, but I'm guessing they won't want to go there -- it's not critical to netcdf itself)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants