Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Get GLM output into netCDF DSG format #31

Open
hcorson-dosch-usgs opened this issue Feb 28, 2022 · 10 comments
Open

Get GLM output into netCDF DSG format #31

hcorson-dosch-usgs opened this issue Feb 28, 2022 · 10 comments
Labels
enhancement New feature or request

Comments

@hcorson-dosch-usgs
Copy link
Contributor

Currently the GLM output is being stored in feather files, with one feather file per lake-gcm combo (6 files per lake). For sharing on sciencebase, we are currently (per #20) zipping these feather files together by tile number (4 zip files in total).

Per Jordan comments,

I'd like to propose a new data release format that uses netcdf discrete sampling geom to put all of the lakes in a single file, like we did with Jared's EA-LSTM data release. Do do so, we'd need to address one challenge with depth but there are options for that.

we'd like to move to storing the output in netCDF DSG format. As GLM generates temperature profiles, this would mean adding another dimension for depth. That is not currently supported by the write_timeseries_dsg() function of ncdfgeom, but I will submit an issue there to see if it would be within scope of that function to add that functionality.

@hcorson-dosch-usgs
Copy link
Contributor Author

Issue submitted here asking if it would be within scope of write_timeseries_dsg() to add an option for a 3rd netCDF dimension (in our case, depth beneath the lake surface)

@hcorson-dosch-usgs
Copy link
Contributor Author

hcorson-dosch-usgs commented Apr 5, 2022

Starting to think about this again.

Variables:

  • Water temperature (degC)
  • ice (would not have same dims....)

Coordinates:

  • latitude of lake centroid
  • longitude of lake centroid

Dims:

  • Site id (per approach used in lake surface temperature netCDFs)
    image
  • time
  • depth
  • GCM? @jread-usgs were you thinking we'd again have a single netCDF for each GCM, or would it be preferable to have all of the predictions, across all GCMs, in a single netCDF?

@lindsayplatt lindsayplatt added the enhancement New feature or request label Apr 7, 2022
@hcorson-dosch-usgs
Copy link
Contributor Author

hcorson-dosch-usgs commented May 9, 2022

Okay I regrouped w/ @jread-usgs on this, as it is again a priority. Here are some notes:

Re: how to group the output into netCDFs

  • netCDF advantage = smaller file and easier to wield than zips (assuming coding competence)
  • Have results for all GCMs in a single netCDF
  • The spatial tiles were useful for scaling jobs, and packing up driver data, but aren't meaningful for packaging up results
    • If we bin results spatially, would want to move to large regional groupings for data delivery = different data aggregation that used for driver data
      • Use approach similar to Jared's work, with regional partitions (Jared's files were around 1GB a piece, for this 2-3GB netCDF files would be good).
        • See lake-surf-data-release repo for how split was made -predict_id_groups (generate group rect)
    • For now, put all results (for MN) into single netCDF
  • If we run into issues w/ file size, we can revisit this, and raise w/ the temperature collaborators
    • Easiest way to split, if we need to, would be to split by GCM

Re: resolution of data storage

  • Plan to compress netCDF files
    • Known issue: compression not working on windows (see issue)
      • Leave for now -- can test by sending Jordan file to quick compress
  • Have separate dimension for depth
    • use same maximum depth for all lakes, and fill with NAs beyond max depth of lake
  • Eventually, don't store predictions for every unique depth
    • resample hypolimnion since less information the closer to you get to the max depth (e.g., c(0, 0.5, 1, 2, 5, 10, 20 m), but this isn’t resolved enough, just an example).

Re: inclusion of ice flags alongside temperature predictions

  • With all GCMs in single netCDF, probably best to include ice flags (0 or 1).
    • Raise question of ice flags vs. ice thickness/height w/ collaborators later on (could be integer in cm [coarse estimate] to reduce size of data)

Plan for development

Working locally w/ subset of 1-5 lakes...

  1. Write depth = 0 temperature predictions ice flags for a single GCM to netCDF using write_timeseries_dsg() (2D)
  2. Modify created netCDF to add a new variable, temperature, with two one additional dimensions: depth (based on max depth of all lakes), and GCM name
  3. Write remaining temperature predictions (for all depths and other GCMs) to netCDF
  4. Write ice flags (for each GCM) to netCDF Assess likely size if scaled
  5. Send draft netCDF to Jordan for compression
  6. Based on compressed netCDF size, re-assess approach. If seems too big,
  7. Explore reducing resolution of predictions at depth
  8. Send draft netCDF to Jordan for compression
  9. Based on size, determine if worth pursuing adding GCM dimension.....
  10. Explore adding GCM dimension
  11. Evaluate approach based on compressed netCDF size

Then if all is working, test w/ all MN predictions
* Based on size, evaluate options for scaling up to full project footprint

@hcorson-dosch-usgs
Copy link
Contributor Author

hcorson-dosch-usgs commented Jul 15, 2022

Ok - @lindsayplatt , @jread-usgs. I wanted to provide an update here of my progress before I'm out for two weeks. This has been a back-burner item for some months now, but I did make some significant progress when other sprint tasks were completed or blocked.

For both the GCM and NLDAS predictions, I've completed steps 1 - 4, 6, and 7 (see previous comment). I skipped step 5 b/c it was immediately apparent that the file size was going to be too large without some reduction in the resolution of predictions at depth.

All of my code is in this branch on my fork.

Here's a summary:

NetCDF dimensions

The code generates 3D netCDF files. The ice flags are stored as a 2D TimeSeries variable (with dimensions of site_id and time), while the temperature predictions are stored as a 3D TimeSeriesProfile variable (with dimensions of site_id, time and depth):
image

Reduction of resolution of predictions at depth.

The GLM output predictions are at 0.5m intervals. If we store all predictions at all depths, the netCDF depth dimension becomes very long, and we store many many NA values for shallow lakes. Currently I am reducing the resolution of predictions at depth prior to packaging the predictions in the netCDF. For example, for the NLDAS netCDF, the depths are defined (in a somewhat hacky way for now) here, based on Andy's depths. The predictions are then subset to those depths here.

Testing netCDF build on Tallgrass

I went so far as to test the generation of the GCM netCDFs and NLDAS netCDF on Tallgrass with a subset of 1000 sites e.g., for NLDAS. Uncompressed, the NLDAS netCDF (with ice flag and temperature predictions for 1000 sites, at a restricted set of depths) is 3.8gb. The GCM netCDFs are each 5.4gb.

Testing netCDF compression on Tallgrass

Jordan noted that the nco and netcdf packages might be available as modules on Tallgrass, and they are. I loaded those modules alongside singularity and slurm modules, and then tried building the NLDAS netcdf with compression (switched this arg to TRUE, commented out these lines), but I got this error, which suggests that the system commands weren't able to be called by R within the container. I then tried running the system commands (ncks -h --fl_fmt=netcdf4 --cnk_plc=g3d --cnk_dmn time,10 --ppc temp=.2#ice=1 GLM_NLDAS_uncompressed.nc GLM_NLDAS.nc) directly on an allocated node (but NOT in the singularity container) and was able to compress the netCDF. So it seems like it should be possible, just may take some troubleshooting to ensure that nco and netCDF are accessible by R in the container. The compressed NLDAS netCDF file (w/ ice flags and preds for 1000 sites) is 297mb. One of the GCM netCDFs (w/ ice flags and preds for 1000 sites) is 422mb when compressed.

Testing extracting the predictions from the netCDF file

I did modify Dave's read_timeseries_dsg() code so that I could extract results from the 3D netCDF files. That code is in a script here - detached from the pipeline for now. The code runs for the netCDF files I generated locally w/ a small # of sites, but I just tested it for the NLDAS netCDF I generated for 1000 sites on Tallgrass and the nc_meta::nc_meta() function returned an error 😕, so I'll have to troubleshoot that when I return.

When I'm back I'd be happy to test building and compression a full NLDAS netCDF with predictions for all of the sites (at restricted depths)

@lindsayplatt
Copy link
Contributor

Great summary for capturing the current state of this work. Looking forward to chatting when you get back 🌴

@hcorson-dosch-usgs
Copy link
Contributor Author

Quick update - Anthony was interested in this netCDF code briefly a couple of months ago, and in re-running my test scripts locally to refresh my memory it turned out that that nc_meta error that I was getting when testing my extraction code was just an issue with the package and is fixed if you install the development version from devtools.

@lindsayplatt
Copy link
Contributor

lindsayplatt commented Jan 6, 2023

I am testing the scaling of this code and approach beyond 1000 lakes on Tallgrass:

tar_make(p3_nldas_glm_uncalibrated_nc) took 8.8 min to build 1000 lakes and the nc file was 3.56 GB (pre-compression, which is a manual step). When I scaled up to all 12,688 lakes, it took XX min to build and the resulting nc file was XX GB.

The job failed after 12.6 hrs with these messages BUT when I try to see the ones that failed, I get nothing

Error:
! problem with the pipeline.
Execution halted
srun: error: dl-0001: task 2: Exited with exit code 1
• built target p3_nldas_glm_uncalibrated_nc
• end pipeline: 12.6 hours
Warning message:
In data.table::fread(file = database$path, sep = database_sep_outer,  :
  Stopped early on line 54797. Expected 18 fields but found 35. Consider fill=TRUE and comment.char=. First discarded non-empty line: <<p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.0415046296s|710c613aa5de3636|608|rds|local|vector|p2_nldas_glm_uncalibrated_runs||7.95|Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm. Custom path to GLM executable set via GLM_PATH environment variable as usrlocalbinGLMglm|p2_nldas_glm_uncalibrated_runs_5ec408cf|branch|d19943e89d4ddbbf|9eb9eb029eb5dc2d|abe35dc056b2b213|347759575||t19230.04>>

Check the ones that errored:

> tar_meta(fields = error, complete_only = TRUE)
# A tibble: 0 × 2
# … with 2 variables: name <chr>, error <lgl>

@hcorson-dosch-usgs
Copy link
Contributor Author

Ok the NLDAS netcdf for 5k lakes built in 1.5 hours and is 17.4 gb uncompressed. After compression it is 1.7gb.

@hcorson-dosch-usgs
Copy link
Contributor Author

Ok the NLDAS netCDF for 10k lakes built in 5.3 hours and is 37.8 gb uncompressed. After compression it is 3.4gb.

@lindsayplatt
Copy link
Contributor

That's a lot of hours but 3.4 gb is great! Will likely need to talk with Andy about splitting his 63k up, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants