Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NetCDF4: Size difference between single write and iterative writes #2862

Open
abhibaruah opened this issue Feb 8, 2024 · 5 comments
Open

Comments

@abhibaruah
Copy link

NetCDF version : 4.9.2
HDF5 version : 1.10.11
OS: Linux

I create two netCDF4 files (file1.nc and file2.nc) with the same properties:
Dimensions: 1000000 x 30
Datatype: NC_FLOAT
Compression: Deflate, Level 9
Chunking: Default

The data I write is also the same (all values = 1)

However, I create them by two separate ways (you can find my repro code below):

  1. file1.nc :
    Written using nc_put_var_float (single call) to write the whole dataset (1000000 x 30) at once.
    Size of file = 123 KB.

  2. file2.nc :
    Written by calling nc_put_vara_float in a for-loop.
    10000 x 30 elements written in each for-loop iteration (file is opened and closed during every for loop iteration)
    Size of the file = 935 KB

** If chunksize is {10000, 30}, file1.nc is 195 KB whereas file2.nc is 201 KB.
** If we open and close file2.nc only once (before and after the for loop), file2.nc size is 123 KB (same as file1.nc).

I compared that the contents and the dimensions of the variables are the same.

file1.nc (default chunking) -> 123 KB
file1.nc ({10000, 30} chunking) -> 195 KB

file2.nc (default chunking -> 935 KB)
file2.nc ({10000, 30} chunking) -> 201 KB

  1. I know that (2) is not the ideal way to write to the variable, but can someone help me understand why there is such a big difference in filesizes, even though the content is the same? If I change the data to rand instead of 1s, file1.nc is around 100MB whereas file2.nc size bloats to almost 1.5 GB.
    Why is the file size same when, for case (2), the file is opened and closed only once?

  2. Why does the size of file2.nc reduce from 935 KB to 201KB just by changing the default chunking parameters to {10000, 30}?

I am attaching my code below. Let me know if you have any questions regarding the same.
Thanks.

/*
file1.nc -> 
1000000 x 30 NC_FLOAT dataset with all values as 1
Deflate Level 9
Default chunking
Written using nc_put_var_float (single call)
Size = 123 KB

file2.nc ->
1000000 x 30 NC_FLOAT dataset with all values as 1
Deflate Level 9
Default chunking
Written by calling nc_put_vara_float in a for-loop.
10000 x 30 elements written in each for-loop iteration (file is opened and closed during every for loop iteration)
SIZE = 935 KB

** If chunksize is {10000, 30}, file1.nc is 195 KB whereas file2.nc is 201 KB.
** If we open and close file2.nc only once (before and after the for loop), file2.nc size is 123 KB (same as file1.nc).
*/
#include <stdio.h>
#include <string>
#include <netcdf.h>
#include <iostream>

/* NetCDF file names */
#define FILE_NAME1 "file1.nc"
#define FILE_NAME2 "file2.nc"


/* Test with 3D data */
#define NDIMS 2
#define NX 1000000
#define NY 30

#define CHUNKX 1000
#define CHUNKY 30

#define ERRCODE 2
#define ERR(e) {printf("Error: %s\n", nc_strerror(e)); exit(ERRCODE);}


int main() {

    int retval;
    int status;
    int ncid;
    int varid;
    int dimids[NDIMS];
    int x_dimid, y_dimid;
    const size_t chunksize[NDIMS] = { CHUNKX, CHUNKY };

    /////////////// FILE 1 ////////////////

    // Dynamically allocate memory for the 2D array
    float* arr = new float[NX * NY];       // An array of size NX x 30, filled with 1
    int index = 0;
    // Traverse the 2D array
    for (int i = 0; i < NX; i++) {
        for (int j = 0; j < NY; j++) {
            
                *(arr + index) = 1;
                index++;

        }
    }


    /* Create a classic NetCDF4 file. */
    if ((retval = nc_create(FILE_NAME1, NC_NETCDF4, &ncid)))
        ERR(retval);
    if (retval != NC_NOERR) {
        printf("Error creating .NC file.\n");
    }

    /* Define dimensions */
    if ((retval = nc_def_dim(ncid, "y", NY, &y_dimid)))
        ERR(retval);
    if ((retval = nc_def_dim(ncid, "x", NX, &x_dimid)))
        ERR(retval);

    dimids[0] = x_dimid;
    dimids[1] = y_dimid;

    /* Define the variable */
    if ((retval = nc_def_var(ncid, "data", NC_FLOAT, NDIMS,
        dimids, &varid)))
        ERR(retval);

    /* CHUNKING */
    //if ((retval = nc_def_var_chunking(ncid, varid, NC_CHUNKED, chunksize)))
    //    ERR(retval);
    
    if ((retval = nc_def_var_deflate(ncid, varid, 0, 1, 9)))
        ERR(retval);

    retval = nc_put_var_float(ncid, varid, arr);

    retval = nc_close(ncid);
    ///////////////////////////////////////////



    /////////////// FILE 2 ////////////////

    int nx = 10000;                // number of rows to be written in each iteration of the for loop
    int ncid2, varid2;
    int size_idx = NX/nx;          // Total number of iterations of the for loop
    size_t start[2];
    size_t count[2];
    
    float* arr2 = new float[nx * NY];  // An array of size nx x 30, filled with 1
    index = 0;
    for (int i = 0; i < nx; i++) {
        for (int j = 0; j < NY; j++) {
            
                *(arr2 + index) = 1;
                index++;

        }
    }

    if ((retval = nc_create(FILE_NAME2, NC_NETCDF4, &ncid)))
        ERR(retval);
    if (retval != NC_NOERR) {
        printf("Error creating .NC file.\n");
    }

    /* Define dimensions */
    if ((retval = nc_def_dim(ncid, "y", NY, &y_dimid)))
        ERR(retval);
    if ((retval = nc_def_dim(ncid, "x", NX, &x_dimid)))
        ERR(retval);
    
    /* Define the variable */
    if ((retval = nc_def_var(ncid, "data", NC_FLOAT, NDIMS,
        dimids, &varid)))
        ERR(retval);

    /* CHUNKING */
    //if ((retval = nc_def_var_chunking(ncid, varid, NC_CHUNKED, chunksize)))
    //    ERR(retval);
    
    if ((retval = nc_def_var_deflate(ncid, varid, 0, 1, 9)))
        ERR(retval);

    nc_close(ncid);

    int printidx = 0;        // Used only for printing

    for (int k = 0; k<size_idx; k++){
        status = nc_open(FILE_NAME2, NC_WRITE, &ncid2);
        status = nc_inq_varid(ncid2, "data", &varid2);
        start[0] = k*nx;
        start[1] = 0;
        count[0] = nx;
        count[1] = 30;
        
        status = nc_put_vara_float(ncid2, varid2, start, count, arr2);
        nc_close(ncid2);

        std::cout << printidx++ <<std::endl;

    }
    ///nc_close(ncid);

    ///////////////////////////////////////////
    
    delete[] arr;
    delete[] arr2;
    
    return 0;
    
}
@abhibaruah
Copy link
Author

Hello all,
I was wondering if anyone got a chance to look at this issue.
Let me know if you need any more information regarding the issue.

@edwardhartnett
Copy link
Contributor

You are using HDF5 1.10, which is old. Can you try with the most recent release?

Also, can you turn compression off and see what happens? You are using a deflate level of 9 which is never a good idea. Instead, use 1. Also turning on the shuffle filter will make compression work better.

If I read correctly, you find that writing a 1000000 x 30 field has different sizes for different write patterns. However, if you adjust the chunksizes, it works better, is that correct? That makes sense.

Consider that this is a very long and thin array, and the defaults are set up to handle more square-shaped arrays, like lat/lon arrays, where the sizes of the arrays are similar magnitude. Whenever you have long, thin arrays, I would not be surprised to see the default chunksizes behave badly.

@abhibaruah
Copy link
Author

abhibaruah commented Feb 20, 2024

Hi @edwardhartnett ,

Unfortunately, it is not straightforward for us to update HDF5 to the more recent versions at this time, because of some internal bugs. If you have it setup already, can you try executing the code above above to see if you can reproduce the same?

Without compression, both the files are of the same size : 117,194 KB.

I changed the compression parameters to turn shuffle on and deflate level of 1.
file1.nc -> 523 KB
file2.nc -> 5,525 KB
The difference in file sizes actually increased.

I also modified the code above such that for the second case, the nc_open, nc_inq_varid and nc_close commands are outside the for-loop. In that case, both the files are of the same size.
It appears that the file difference happens only when the nc file is opened and closed during every for-loop iteration.
Does this add anything to understand the difference more?

status = nc_open(FILE_NAME2, NC_WRITE, &ncid2);
    status = nc_inq_varid(ncid2, "data", &varid2);
    for (int k = 0; k<size_idx; k++){
        
        start[0] = k*nx;
        start[1] = 0;
        count[0] = nx;
        count[1] = 30;
        
        status = nc_put_vara_float(ncid2, varid2, start, count, arr2);
        std::cout << printidx++ <<std::endl;

    }
    nc_close(ncid2);

@abhibaruah
Copy link
Author

Hi @edwardhartnett,
I was wondering if you got a chance to look at my responses to your questions earlier.
Let me know if you need any other information from my side.

Thanks,
Abhi

@edwardhartnett
Copy link
Contributor

What I'm seeing is that you get different file sizes with different behavior. This is expected in HDF5.

If you think this is a bug in HDF5, then you should construct a HDF5 test case and submit it to the HDF5 team.

I don't think this is a netCDF bug.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants