Are there non-obvious best practices for saving memory and operations #1482

tomchor · 2021-03-16T17:23:39Z

tomchor
Mar 16, 2021
Collaborator

I've been running some simulations GPU that are very limited by the GPU's memory. Although it seems like we're getting close to be able to run multi-GPU simulations (thanks, @ali-ramadhan!), it's good practice I think to try and save memory in any case, and also save on operations (id est, to run as few operations as possible).

I know there are obvious to do that (like running smaller simulations, limiting the amount of tracers, etc), but what are the non-obvious ways? For example, I noticed that in computed fields you can specify the data argument when creating that, although I'm not sure how to use that. Can that be used to save memory?

Thanks!

Answered by ali-ramadhan

Mar 16, 2021

I think IncompressibleModel{MultiGPU} might still be a little far off since we need distributed FFT support for GPUs from PencilFFTs.jl. Could happen soon but no ETA. ShallowWaterModel{MultiGPU} does work thanks to @francispoulin but might need some profiling to find bottlenecks.

I think you touched on the obvious ways!

I guess another way is to find a bigger GPU. Some of the higher-end NVIDIA GPUs have 32 GB of memory, but not sure there are any common ones with more.

A riskier way is to use Float32 to half your memory footprint, but then you might end up having to manage truncation errors as discussed in #1410.

I assume you're already using advection = WENO5() but you could use a higher…

View full answer

ali-ramadhan · 2021-03-16T18:02:02Z

ali-ramadhan
Mar 16, 2021
Maintainer

I think IncompressibleModel{MultiGPU} might still be a little far off since we need distributed FFT support for GPUs from PencilFFTs.jl. Could happen soon but no ETA. ShallowWaterModel{MultiGPU} does work thanks to @francispoulin but might need some profiling to find bottlenecks.

I think you touched on the obvious ways!

I guess another way is to find a bigger GPU. Some of the higher-end NVIDIA GPUs have 32 GB of memory, but not sure there are any common ones with more.

A riskier way is to use Float32 to half your memory footprint, but then you might end up having to manage truncation errors as discussed in #1410.

I assume you're already using advection = WENO5() but you could use a higher-order advection scheme at coarser resolution and hope for the best (as a last resort).

For example, I noticed that in computed fields you can specify the data argument when creating that, although I'm not sure how to use that. Can that be used to save memory?

A dirty trick Oceananigans.jl used to do was to use model.pressures.pHY′ as a scratch space for computed fields (which is fine since pHY′ is recomputed at every time step). So you could technically avoid allocating any scratch space. I don't think it works properly for computed fields that are not on (Center, Center, Center) but maybe something hacky could be done via exploiting halo regions. This feels extra dirty now though so not sure I'd recommend it lol.

A cleaner solution is to do what @glwagner did for the LESbrary.jl: use one scratch field per location, e.g. https://github.com/CliMA/LESbrary.jl/blob/cf31b0ec20219d5ad698af334811d448c27213b0/examples/three_layer_constant_fluxes.jl#L380-L385

6 replies

tomchor Mar 16, 2021
Collaborator Author

A cleaner solution is to do what @glwagner did for the LESbrary.jl: use one scratch field per location, e.g. https://github.com/CliMA/LESbrary.jl/blob/cf31b0ec20219d5ad698af334811d448c27213b0/examples/three_layer_constant_fluxes.jl#L380-L385

Thanks! So what exactly does this do? Does it save memory by preventing CUDA from creating a separate cuarray for each diagnostic?

glwagner Mar 16, 2021
Maintainer

The data kwarg in the ComputedField constructor allows a user to specify a pointer to the underlying memory into which the result of ComputedField.operand is stored.

If data is not supplied, then it takes on the default value of nothing and memory is instead allocated when ComputedField is constructed (by the function new_data):

Oceananigans.jl/src/Fields/computed_field.jl

Lines 67 to 80 in 9b52f3f

    
           function ComputedField(operand; data = nothing, recompute_safely = true, 
        
                                  boundary_conditions = ComputedFieldBoundaryConditions(operand.grid, location(operand))) 
        
               loc = location(operand) 
        
               arch = architecture(operand) 
        
               grid = operand.grid 
        
               if isnothing(data) 
        
                   data = new_data(arch, grid, loc) 
        
                   recompute_safely = false 
        
               end 
        
               return ComputedField{loc[1], loc[2], loc[3]}(data, grid, operand, boundary_conditions; recompute_safely=recompute_safely) 
        
           end

This feature is also built into KernelComputedField:

Oceananigans.jl/src/Fields/kernel_computed_field.jl

Lines 23 to 25 in 9b52f3f

    
           if isnothing(data) 
        
               data = new_data(arch, grid, (X, Y, Z)) 
        
           end

tomchor Mar 16, 2021
Collaborator Author

Thanks, @glwagner. I think I understand the mechanics of the code (like it being nothing by default, which makes it be created for each ComputedField and KernelComputedField separately), but I guess I don't fully understand the implications and trade-offs of letting the code create a scratch space for each diagnostic versus supplying data yourself (that is used for multiple diagnostics).

It seems to me like leaving data=nothing on all your diagnostics will make the code create one new pointer for each diagnostic per iteration (assuming the diagnostics are calculated at every time step). This would make the code more "parallel", thus potentially saving time but increasing memory costs per time step.

Conversely it seems to me that specifying a single scratch space data for all your diagnostics would make the code more serialized, since calculation of each diagnostic has to finish to free that scratch space before the next diagnostic calculation can begin. This in turn saves memory, but possibly makes the code a bit slower.

And since our runs are generally memory-limited, it makes sense to use the latter approach.

Am I on the right track here?

ali-ramadhan Mar 17, 2021
Maintainer

Looking at the compute! function I think at least for now each ComputedField is computed before another one can be computed:

Oceananigans.jl/src/Fields/computed_field.jl

Lines 87 to 106 in 2dc0826

    
           function compute!(comp::ComputedField{X, Y, Z}, time=nothing) where {X, Y, Z} 
        
               compute_at!(comp.operand, time) # ensures any 'dependencies' of the computation are computed first 
        
               arch = architecture(comp.data) 
        
               workgroup, worksize = work_layout(comp.grid, 
        
                                                 :xyz, 
        
                                                 include_right_boundaries=true, 
        
                                                 location=(X, Y, Z)) 
        
               compute_kernel! = _compute!(device(arch), workgroup, worksize)  
        
               event = compute_kernel!(comp.data, comp.operand; dependencies=Event(device(arch))) 
        
               wait(device(arch), event) 
        
               fill_halo_regions!(comp, arch) 
        
               return nothing 
        
           end

So at least for now I don't think there are any performance drawbacks of reusing scratch space. I think the default = nothing is just so users don't have to worry about allocating the required memory (unless a user really needs to squeeze out more grid points from a GPU like you do).

glwagner Mar 18, 2021
Maintainer

Right, compute! always waits. Allocating memory for users is pure convenience rather than a performance tradeoff.
We could require users to allocate all their own memory --- but perhaps that would be less friendly?

Asynchronous computations might be worth thinking about. However, I think that pattern helps mostly for small problems in which it could help to evaluate multiple kernels at the same time. For larger problems I suspect its not as crucial (but I'm not totally sure).

glwagner · 2021-03-16T20:15:17Z

glwagner
Mar 16, 2021
Maintainer

I think apart from OP @tomchor's recommendation, @ali-ramadhan mentioned the most important memory saving technique for simulations with lots of diagnostics (using a single scratch space for ComputedFields).

I'll just say that there are two other possibly important techniques: 1) eliminating the hydrostatic pressure as an auxiliary variable as discussed on #1443 (which reduces Oceananigans memory footprint by one field), and 2) figuring out how to use mapreduce to compute averages of AbstractOperations with no intermediate memory allocation as proposed by #1422.

We could also implement a ForwardEulerTimeStepper that completely avoids memory allocation for tendencies.

3 replies

tomchor Mar 16, 2021
Collaborator Author

Thanks. So since these 3 options are all on the developer side, it seems that apart from the obvious and the suggestion to save on scratch space there isn't much the user can do, huh.

Are there plans to implement any of the developer-side options you mentioned? (I remember a conversation we had where you mention that you decided not to eliminate to hydrostatic pressure.)

glwagner Mar 16, 2021
Maintainer

We would very much like to eliminate hydrostatic pressure (which among wasting memory also limits parallelization as discussed on #1443). We think this might require rerunning the regression tests (though its not obvious --- it might not), which we don't want to do just now because it will bloat the size of the repo. It might also be best to fix up #1242 first too.

#1422 is also probably important.

Some of us are currently engaged in a sprint to implement a GCM in Oceananigans, so we have to pick the problems we work on carefully...

tomchor Mar 16, 2021
Collaborator Author

Ah, I see. Thanks!

(About #1242, all that's left is to address the issues you raised, so I think we're very close. @ali-ramadhan and I talked and we're happy enough with the validations I've done so far to merge it into the code.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are there non-obvious best practices for saving memory and operations #1482

{{title}}

Replies: 2 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Are there non-obvious best practices for saving memory and operations #1482

tomchor Mar 16, 2021 Collaborator

Replies: 2 comments · 9 replies

ali-ramadhan Mar 16, 2021 Maintainer

tomchor Mar 16, 2021 Collaborator Author

glwagner Mar 16, 2021 Maintainer

tomchor Mar 16, 2021 Collaborator Author

ali-ramadhan Mar 17, 2021 Maintainer

glwagner Mar 18, 2021 Maintainer

glwagner Mar 16, 2021 Maintainer

tomchor Mar 16, 2021 Collaborator Author

glwagner Mar 16, 2021 Maintainer

tomchor Mar 16, 2021 Collaborator Author

tomchor
Mar 16, 2021
Collaborator

Replies: 2 comments 9 replies

ali-ramadhan
Mar 16, 2021
Maintainer

tomchor Mar 16, 2021
Collaborator Author

glwagner Mar 16, 2021
Maintainer

tomchor Mar 16, 2021
Collaborator Author

ali-ramadhan Mar 17, 2021
Maintainer

glwagner Mar 18, 2021
Maintainer

glwagner
Mar 16, 2021
Maintainer

tomchor Mar 16, 2021
Collaborator Author

glwagner Mar 16, 2021
Maintainer

tomchor Mar 16, 2021
Collaborator Author