Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update ESM1.5 driver file naming. #93

Merged
merged 14 commits into from
Sep 23, 2024
Merged

Conversation

blimlim
Copy link
Collaborator

@blimlim blimlim commented Aug 29, 2024

This pr attempts to close #91.

It updates the naming of the output netCDF files to include a YYYYMM date string, and a suffix stating the data frequency in a file. E.g. the fields file aiihca.paa1jan gets converted to aiihca.pa-010101_mon.nc following the convention in the p73 archive.

Getting this information involves two steps:

  1. Reading the file date from the fields file header, via mule.FixedLengthHeader.from_file
  2. Converting the unit key in the file name (the 9th character) to a frequency suffix, which should be one of mon,dai,6hr, or 3hr.

Originally setting the output filepath was handled by get_nc_write_path(), which was called from inside convert_fields_file_list() and just glued .nc onto the name:

for fields_file_path in fields_file_paths:
nc_write_path = get_nc_write_path(fields_file_path, nc_write_dir)

I thought adding changes 1 and 2 to the naming here would involve burying more I/O which might make convert_fields_file_list harder to test, and also would make it the function a bit inflexible (e.g. the same function couldn't be used with a different naming convention).

I tried moving the file naming out from the conversion function up to the higher level convert_esm1p5_output_dir function. It uses an updated get_nc_write_path() to create a list of (input_path, output_path) pairs, which gets fed into convert_fields_file_list

input_output_pairs = [
(ff_path, get_nc_write_path(ff_path, nc_write_dir))
for ff_path in atm_dir_fields_files
]
succeeded, failed = convert_fields_file_list(input_output_pairs)

get_nc_write_path then performs steps 1 and 2, and I tried to make it so that it could be a target for more "mid level" testing.

I went back and forth a bit on the best way to structure the changes and am still not completely sure, so any suggestions would be welcome!

@truth-quark
Copy link
Collaborator

There's a few things here:

  • The side by side diff looks more complicated as find_matching_fields_files() moved position. I reordered locally & the diff was easier to parse.
  • an existing function signature has changed
  • the tests need to mock get_ff_date(), which signifies an unwanted dependency on I/O or external thing

A few changes can be made to simplify the code & touch less of the existing interface. The mule/paths/dates I/O can be moved higher, so the paths can be sorted out at a high level, only passing data to lower level funcs. The fixes change this loop:

input_output_pairs = [
(ff_path, get_nc_write_path(ff_path, nc_write_dir))
for ff_path in atm_dir_fields_files
]

I'd recommend the following changes:

  • Loop over atm_dir_fields_files for paths for creating input_output_paths
  • Move the get_ff_date() call into the loop to get dates for each file
  • Tweak get_nc_write_path(fields_file_path, nc_write_dir) to get_nc_write_path(fields_file_path, nc_write_dir, date=None)
    • If the date input is None, default to the prior behaviour of adding .nc
    • Otherwise return the new date structured filename
  • Call get_nc_write_path() in this top loop to create input_output_paths
  • See if removing the test mocks works

Do you want to see if those changes help simplify the naming feature & tests?

@truth-quark
Copy link
Collaborator

Ah, there's a video about software architecture called "Boundaries", also known as "Functional Core, Imperative Shell" (FCIS). This is the sort of approach guiding um2nc dev. I think the latest talk is this one:
https://www.youtube.com/watch?v=yTkzNHF6rMs

@blimlim
Copy link
Collaborator Author

blimlim commented Sep 12, 2024

Sweet, I've had a go at putting in these changes.

The side by side diff looks more complicated as find_matching_fields_files() moved position. I reordered locally & the diff was easier to parse.

Oops sorry about that! This should be fixed in aff66bd and hopefully it's a bit easier to read the changes.

I've moved the get_ff_date function up into the loop, which then lets us remove that extra mock in the tests.

I have a couple of extra questions/comments, which I've added over the code.

Copy link
Collaborator

@truth-quark truth-quark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've listed a few first round fixes (somehow they got bundled into a review instead of individual comments).

The biggest current problem is the failing CI, once that passes we can go over other steps.

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved
umpost/conversion_driver_esm1p5.py Show resolved Hide resolved
umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved
umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@truth-quark truth-quark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a review in the event comments are blocked without one.

Copy link
Collaborator

@truth-quark truth-quark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be good to go!

@blimlim
Copy link
Collaborator Author

blimlim commented Sep 19, 2024

@marc-white, I'm just pinging you for a second pair of eyes to look over the changes. Any suggestions or feedback you have are welcome!

Comment on lines +145 to +146
stem = fields_file_name[0:FF_UNIT_INDEX + 1]
unit = fields_file_name[FF_UNIT_INDEX]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This suffices for this use case, but you may want to consider adding a TODO to switch over to regular expressions. They're more flexible (especially if this conversion ever needs to be done for other models/filename patterns), and I think they're also a bit clearer as to what you're doing (i.e., it would be more explicit what you're pulling out of the filename pattern and from where).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also forgot to mention that REs would give immediate feedback on whether or not fields_file_name conforms to the expected pattern.

Copy link
Collaborator Author

@blimlim blimlim Sep 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good idea. I've added a # TODO comment in b014029 and will make a new issue for this.


assert nc_write_path == Path("/test/path/NetCDF/fields_123.file.nc")
assert nc_write_path == expected_nc_write_path

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a need to test against what happens if a totally invalid filename gets fired into get_nc_write_path? Is that behaviour even defined/consistent?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. Currently the get_nc_write_path is only being provided file names that match some sort of expected format provided by the find_matching_fields_files function, however it probably makes sense to more explicitly deal with invalid file names being supplied. I think if we swap to using regex following your suggestion, this should be easier!

Copy link

@marc-white marc-white left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is ready to go as-is; comments added are things to consider/possible TODO items to add.

@blimlim blimlim merged commit 0ee303a into develop Sep 23, 2024
4 checks passed
@blimlim blimlim deleted the 91/update-esm-netcdf-names branch September 23, 2024 03:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants