Update ESM1.5 driver file naming. #93

blimlim · 2024-08-29T07:35:09Z

This pr attempts to close #91.

It updates the naming of the output netCDF files to include a YYYYMM date string, and a suffix stating the data frequency in a file. E.g. the fields file aiihca.paa1jan gets converted to aiihca.pa-010101_mon.nc following the convention in the p73 archive.

Getting this information involves two steps:

Reading the file date from the fields file header, via mule.FixedLengthHeader.from_file
Converting the unit key in the file name (the 9th character) to a frequency suffix, which should be one of mon,dai,6hr, or 3hr.

Originally setting the output filepath was handled by get_nc_write_path(), which was called from inside convert_fields_file_list() and just glued .nc onto the name:

um2nc-standalone/umpost/conversion_driver_esm1p5.py

Lines 141 to 143 in 70a890c

    
           for fields_file_path in fields_file_paths: 
        
               nc_write_path = get_nc_write_path(fields_file_path, nc_write_dir)

I thought adding changes 1 and 2 to the naming here would involve burying more I/O which might make convert_fields_file_list harder to test, and also would make it the function a bit inflexible (e.g. the same function couldn't be used with a different naming convention).

I tried moving the file naming out from the conversion function up to the higher level convert_esm1p5_output_dir function. It uses an updated get_nc_write_path() to create a list of (input_path, output_path) pairs, which gets fed into convert_fields_file_list

um2nc-standalone/umpost/conversion_driver_esm1p5.py

Lines 343 to 347 in 2c5fcea

    
           input_output_pairs = [ 
        
               (ff_path, get_nc_write_path(ff_path, nc_write_dir)) 
        
               for ff_path in atm_dir_fields_files 
        
           ] 
        
           succeeded, failed = convert_fields_file_list(input_output_pairs)

get_nc_write_path then performs steps 1 and 2, and I tried to make it so that it could be a target for more "mid level" testing.

I went back and forth a bit on the best way to structure the changes and am still not completely sure, so any suggestions would be welcome!

truth-quark · 2024-08-30T06:06:08Z

There's a few things here:

The side by side diff looks more complicated as find_matching_fields_files() moved position. I reordered locally & the diff was easier to parse.
an existing function signature has changed
the tests need to mock get_ff_date(), which signifies an unwanted dependency on I/O or external thing

A few changes can be made to simplify the code & touch less of the existing interface. The mule/paths/dates I/O can be moved higher, so the paths can be sorted out at a high level, only passing data to lower level funcs. The fixes change this loop:

um2nc-standalone/umpost/conversion_driver_esm1p5.py

Lines 343 to 346 in 2c5fcea

    
           input_output_pairs = [ 
        
               (ff_path, get_nc_write_path(ff_path, nc_write_dir)) 
        
               for ff_path in atm_dir_fields_files 
        
           ]

I'd recommend the following changes:

Loop over atm_dir_fields_files for paths for creating input_output_paths
Move the get_ff_date() call into the loop to get dates for each file
Tweak get_nc_write_path(fields_file_path, nc_write_dir) to get_nc_write_path(fields_file_path, nc_write_dir, date=None)
- If the date input is None, default to the prior behaviour of adding .nc
- Otherwise return the new date structured filename
Call get_nc_write_path() in this top loop to create input_output_paths
See if removing the test mocks works

Do you want to see if those changes help simplify the naming feature & tests?

truth-quark · 2024-08-30T06:48:11Z

Ah, there's a video about software architecture called "Boundaries", also known as "Functional Core, Imperative Shell" (FCIS). This is the sort of approach guiding um2nc dev. I think the latest talk is this one:
https://www.youtube.com/watch?v=yTkzNHF6rMs

blimlim · 2024-09-12T05:15:07Z

Sweet, I've had a go at putting in these changes.

The side by side diff looks more complicated as find_matching_fields_files() moved position. I reordered locally & the diff was easier to parse.

Oops sorry about that! This should be fixed in aff66bd and hopefully it's a bit easier to read the changes.

I've moved the get_ff_date function up into the loop, which then lets us remove that extra mock in the tests.

I have a couple of extra questions/comments, which I've added over the code.

umpost/conversion_driver_esm1p5.py

truth-quark

I've listed a few first round fixes (somehow they got bundled into a review instead of individual comments).

The biggest current problem is the failing CI, once that passes we can go over other steps.

umpost/conversion_driver_esm1p5.py

… 91/update-esm-netcdf-names

…df-names

umpost/conversion_driver_esm1p5.py

truth-quark

Adding a review in the event comments are blocked without one.

truth-quark

This should be good to go!

blimlim · 2024-09-19T23:34:15Z

@marc-white, I'm just pinging you for a second pair of eyes to look over the changes. Any suggestions or feedback you have are welcome!

marc-white · 2024-09-20T02:45:24Z

umpost/conversion_driver_esm1p5.py

+    stem = fields_file_name[0:FF_UNIT_INDEX + 1]
+    unit = fields_file_name[FF_UNIT_INDEX]


This suffices for this use case, but you may want to consider adding a TODO to switch over to regular expressions. They're more flexible (especially if this conversion ever needs to be done for other models/filename patterns), and I think they're also a bit clearer as to what you're doing (i.e., it would be more explicit what you're pulling out of the filename pattern and from where).

Also forgot to mention that REs would give immediate feedback on whether or not fields_file_name conforms to the expected pattern.

That's a good idea. I've added a # TODO comment in b014029 and will make a new issue for this.

marc-white · 2024-09-20T03:21:57Z

test/test_conversion_driver_esm1p5.py


-    assert nc_write_path == Path("/test/path/NetCDF/fields_123.file.nc")
+    assert nc_write_path == expected_nc_write_path



Is there a need to test against what happens if a totally invalid filename gets fired into get_nc_write_path? Is that behaviour even defined/consistent?

Good question. Currently the get_nc_write_path is only being provided file names that match some sort of expected format provided by the find_matching_fields_files function, however it probably makes sense to more explicitly deal with invalid file names being supplied. I think if we swap to using regex following your suggestion, this should be easier!

marc-white

I think this is ready to go as-is; comments added are things to consider/possible TODO items to add.

blimlim added 5 commits August 28, 2024 15:30

fix netCDF capitalisation

27c7111

Add new filename functions and pull netCDF naming to higher level

4b9c9e6

Add tests of netcdf renaming

4fddab4

Cleaning comments and names

37a7a36

Add missing nc filepath, update naming for unknown keys

2c5fcea

blimlim requested a review from truth-quark August 29, 2024 07:35

blimlim added 3 commits September 11, 2024 14:49

Return find_matching_fields_files to original position

eb4df67

Suggestions from review

d3369b1

Suggestions from review, move get_ff_date to higher level

aff66bd

blimlim commented Sep 12, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Show resolved Hide resolved

blimlim commented Sep 12, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 16, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 16, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

umpost/conversion_driver_esm1p5.py Show resolved Hide resolved

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

blimlim added 4 commits September 17, 2024 16:10

Merge remote-tracking branch 'origin/91/update-esm-netcdf-names' into…

fb8445d

… 91/update-esm-netcdf-names

Review style changes, pylint fix

c040780

Merge remote-tracking branch 'origin/develop' into 91/update-esm-netc…

e039ff7

…df-names

Docstring clarity, return None date test

e9d0d79

truth-quark reviewed Sep 18, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 18, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 19, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 19, 2024

View reviewed changes

umpost/conversion_driver_esm1p5.py Outdated Show resolved Hide resolved

truth-quark reviewed Sep 19, 2024

View reviewed changes

Review suggestions

797a824

truth-quark approved these changes Sep 19, 2024

View reviewed changes

blimlim requested a review from marc-white September 19, 2024 23:33

blimlim mentioned this pull request Sep 19, 2024

91/update esm netcdf names noqa test #103

Closed

marc-white reviewed Sep 20, 2024

View reviewed changes

marc-white approved these changes Sep 20, 2024

View reviewed changes

Add TODO for filename regex

b014029

blimlim merged commit 0ee303a into develop Sep 23, 2024
4 checks passed

blimlim deleted the 91/update-esm-netcdf-names branch September 23, 2024 03:56

This was referenced Sep 23, 2024

Merge updates from develop into main #105

Merged

ESM1.5 driver output file names #91

Closed

blimlim mentioned this pull request Oct 3, 2024

ESM1.5 driver: guard against file name collisions #124

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update ESM1.5 driver file naming. #93

Update ESM1.5 driver file naming. #93

blimlim commented Aug 29, 2024

truth-quark commented Aug 30, 2024

truth-quark commented Aug 30, 2024

blimlim commented Sep 12, 2024

truth-quark left a comment

truth-quark left a comment

truth-quark left a comment

blimlim commented Sep 19, 2024

marc-white Sep 20, 2024

marc-white Sep 20, 2024

blimlim Sep 20, 2024 •

edited

Loading

marc-white Sep 20, 2024

blimlim Sep 20, 2024

marc-white left a comment

	for fields_file_path in fields_file_paths:

	nc_write_path = get_nc_write_path(fields_file_path, nc_write_dir)

	input_output_pairs = [
	(ff_path, get_nc_write_path(ff_path, nc_write_dir))
	for ff_path in atm_dir_fields_files
	]
	succeeded, failed = convert_fields_file_list(input_output_pairs)

		stem = fields_file_name[0:FF_UNIT_INDEX + 1]
		unit = fields_file_name[FF_UNIT_INDEX]


		assert nc_write_path == Path("/test/path/NetCDF/fields_123.file.nc")
		assert nc_write_path == expected_nc_write_path

Update ESM1.5 driver file naming. #93

Update ESM1.5 driver file naming. #93

Conversation

blimlim commented Aug 29, 2024

truth-quark commented Aug 30, 2024

truth-quark commented Aug 30, 2024

blimlim commented Sep 12, 2024

truth-quark left a comment

Choose a reason for hiding this comment

truth-quark left a comment

Choose a reason for hiding this comment

truth-quark left a comment

Choose a reason for hiding this comment

blimlim commented Sep 19, 2024

marc-white Sep 20, 2024

Choose a reason for hiding this comment

marc-white Sep 20, 2024

Choose a reason for hiding this comment

blimlim Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

marc-white Sep 20, 2024

Choose a reason for hiding this comment

blimlim Sep 20, 2024

Choose a reason for hiding this comment

marc-white left a comment

Choose a reason for hiding this comment

blimlim Sep 20, 2024 •

edited

Loading