Move dimensions to allocator #85

Nanoseb · 2024-04-09T16:25:51Z

closes #77

semi-h · 2024-04-10T09:39:02Z

The main change here is introducing new member variables and removing some of them, involving dirps_t, field_t, and allocator_t classes. Some of these are great and I completely agree, but there are a few I have some concerns.

cdims in allocator_t ✔️
- A new member stores the actual sizes of the subdomain, I think very helpful. It also helped getting rid of globs in certain places which is great.
n_groups_x/y/z in allocator_t ✔️
- We manage the padding in the allocator and the number of groups in the domain directly depends on this, its a good idea to store this info in the allocator.
Add n in field_t and remove it from dirps_t
- I think this is the most important one, mainly due to the difference between velocity n and pressure n. I'll leave a separate review on this so that I can point out specific examples and suggest an alternative solution. I think going ahead with this will make managing the difference between velocity/pressure grid harder later on.
Add sz in field_t
- If we want to avoid using the SZ from m_omp_common, then I think we can make sz a member of the backend_t. Because once the backend is instantiated, SZ is fixed and it'll never change. Having this as a member in field_t is a bit misleading in my opinion, it implies as if fields might have a different SZ value, but its never the case. There are a few locations where passing and using sz as a variable instead of using it as a parameter will have performance issues too, I'll comment on these separately.
Add n_groups in field_t and remove the corresponding member from dirps_t
- As far as I can see this only replaced accessing n_groups from dirps_t with field_t. Also, it didn't really allowed us to avoid passing dirps_t in any of the subroutines we access this member, so I'm not sure if this simplifies anything. In my view its better to store n_groups in dirps_t because it doesn't really depend on the field instance, field instance is just a scratch space after all, but it completely depends on the direction we're working at a given time.

There is also a new member n_padded in field_t but its not used anywhere yet, so not commenting on it.

semi-h

Here I focused entirely on SZ.

semi-h · 2024-04-10T09:47:32Z

src/cuda/backend.f90

-    backend%yblocks = dim3(globs%n_groups_y, 1, 1)
-    backend%zthreads = dim3(SZ, 1, 1)
-    backend%zblocks = dim3(globs%n_groups_z, 1, 1)
+    sz = allocator%sz


For example, here I would assign the allocator%sz into a new sz member of the backend_t if we wan't to avoid using SZ parameter from m_omp_common/m_cuda_common. In all the use cases of field_t%sz the PR introduces we can instead do self%sz.

However, I think using the SZ parameter directly from m_omp_common/m_cuda_common is a better idea, but I don't really have a strong opinion on this. I think if we're already under omp/ or cuda/, its fine to rely on the corresponding SZ parameter directly. Because otherwise we use sz from allocator_t to set a very critical value for us.

semi-h · 2024-04-10T09:55:21Z

src/cuda/backend.f90

@@ -360,7 +365,7 @@ subroutine tds_solve_cuda(self, du, u, dirps, tdsops)
      error stop 'DIR mismatch between fields and dirps in tds_solve.'
    end if

-    blocks = dim3(dirps%n_blocks, 1, 1); threads = dim3(SZ, 1, 1)
+    blocks = dim3(u%n_groups, 1, 1); threads = dim3(u%sz, 1, 1)


For example here sz doesn't depend on the field u. This is always the same SZ that practically backend instantiation freezes.

semi-h · 2024-04-10T09:58:22Z

src/poisson_fft.f90


-    self%nx = xdirps%n; self%ny = ydirps%n; self%nz = zdirps%n
+    sz = allocator%sz


With #83 possion solver doesn't require SZ any more, just wanted to give an update.

semi-h · 2024-04-10T10:01:43Z

src/omp/exec_dist.f90

@@ -2,7 +2,6 @@ module m_omp_exec_dist
  use mpi

  use m_common, only: dp
-  use m_omp_common, only: SZ


Not being able to access SZ as a parameter will probably make performance worse, because compiler won't be certain whether the simd loops can be vectorised or not at compile time.

semi-h · 2024-04-10T10:12:45Z

src/omp/exec_dist.f90

@@ -127,15 +128,15 @@ subroutine exec_dist_transeq_compact( &
      ! Handle dud by locally generating u*v
      do j = 1, n
        !$omp simd
-        do i = 1, SZ
+        do i = 1, sz


For example here we basically tell compiler to do its best to vectorise the loop here. But if the loop here goes from 1 to 7 lets say, its impossible for compiler to vectorise this. And if sz here is a variable not a parameter, compiler can't be sure at compile time that this is a good integer for vectorisation. In the best case scenario it'll generate multiple versions of assembly, and will have to decide which one to execute only at runtime, which is not ideal. It may alltogether refuse to generate vectorised assembly, and this will depend on the compiler. So here I recommend using SZ as a parameter directly from m_omp_common.

semi-h · 2024-04-10T10:14:53Z

src/omp/exec_dist.f90

-    allocate (ud_recv_e(SZ, n_halo))
-    allocate (ud_recv_s(SZ, n_halo))
+    allocate (ud(sz, n))
+    allocate (ud_recv_e(sz, n_halo))


Another minor example, if we have both sz and n_halo as a parameter, it'll probably be better for performance as the size of the allocation can be determined at the compile time. n_halo is not a parameter yet, but we can actually make it, its always a constant value in backends and there are places we hardcoded things based on n_halo=4.

semi-h · 2024-04-10T10:15:38Z

src/omp/exec_dist.f90

@@ -177,7 +178,7 @@ subroutine exec_dist_transeq_compact( &

      do j = 1, n
        !$omp simd
-        do i = 1, SZ
+        do i = 1, sz


Just another example where a parameter SZ will make performance better.

semi-h

I'm starting a new review specifically for discussing field_t%n.

I tried to avoid inferring the actual size the tridiagonal solver will make use from the domain shape. This is mainly because of the difference between velocity grid and pressure grid.

I'll try to give an example:
We call tds_solve in various settings and often we pass a velocity grid data and get back pressure grid data such as

x3d2/src/solver.f90

Lines 284 to 285 in ce2efbc

    
           call self%backend%tds_solve(du_x, u, self%xdirps, & 
        
                                       self%xdirps%stagder_v2p)

In this case, its very important to use the correct n in the tds_solve.

x3d2/src/cuda/backend.f90

Line 369 in ce2efbc

subroutine tds_solve_dist(self, du, u, dirps, tdsops, blocks, threads)

The tdsops class we have is the key here. It has got an n member where the size of the tridiagonal system is defined. Its always the correct value, either n of velocity or n of pressure because it is literally the class that defines the operation. It is determined based on the operation is a _p2v or _v2p or so on. We pass this down to exec_dist via tdsops_dev in #L409 below.

x3d2/src/cuda/backend.f90

Lines 404 to 411 in ce2efbc

    
           call exec_dist_tds_compact( & 
        
             du_dev, u_dev, & 
        
             self%u_recv_s_dev, self%u_recv_e_dev, & 
        
             self%du_send_s_dev, self%du_send_e_dev, & 
        
             self%du_recv_s_dev, self%du_recv_e_dev, & 
        
             tdsops_dev, dirps%nproc, dirps%pprev, dirps%pnext, & 
        
             blocks, threads & 
        
             )

And evantually we call the kernel with tdsops%n below in #L43.

x3d2/src/cuda/exec_dist.f90

Lines 40 to 44 in ce2efbc

    
           call der_univ_dist<<<blocks, threads>>>( & !& 
        
             du, du_send_s, du_send_e, u, u_recv_s, u_recv_e, & 
        
             tdsops%coeffs_s_dev, tdsops%coeffs_e_dev, tdsops%coeffs_dev, & 
        
             tdsops%n, tdsops%dist_fw_dev, tdsops%dist_bw_dev, tdsops%dist_af_dev & 
        
             )

Halo exchange

The actual n of the field is also important in halo exchanges. Because if we can't determine the actual n, whether its pressure or velocity, we can't send the right last 4 slices to the neighbouring ranks.

x3d2/src/cuda/backend.f90

Lines 395 to 396 in ce2efbc

    
           call copy_into_buffers(self%u_send_s_dev, self%u_send_e_dev, u_dev, & 
        
                                  tdsops_dev%n)

That's why we need to pass an n into the copy_into_buffers, so that the last 4 layers are accessed correctly, based on its pressure or velocity data. See below.

x3d2/src/omp/backend.f90

Line 412 in ce2efbc

u_send_e(i, j, k) = u(i, n - n_halo + j, k)

Because of this we need to pass an n value to the copy_into_buffers, and can't assume its field_t%n.

I think having a field_t%n is not a good idea because of all these. It'll be misleading and the n in a given direction can be stored in dirps_t%n. When its safe to do so we can use it, or when necessary we'll have to access tdsops_t%n.

semi-h · 2024-04-10T10:49:34Z

src/omp/backend.f90

-          u_send_e(i, j, k) = u(i, n - n_halo + j, k)
+        do i = 1, u%sz
+          u_send_s(i, j, k) = u%data(i, j, k)
+          u_send_e(i, j, k) = u%data(i, u%n - n_halo + j, k)


In particular here, we can't have field_t%n. We have to pass an n value into this subroutine which is determined from the corresponding tdsops, when copy_into_buffers is called in tds_solve. tds_solve always have a tdsops, and passing the tdsops%n down to copy_into_buffers is necessary. Otherwise we'll end up sending wrong set of slices.

semi-h · 2024-04-10T10:51:55Z

src/allocator.f90

+      handle%n = -1
+      handle%SZ = -1
+      handle%n_padded = -1
+      handle%n_groups = -1 


I think this also suggests that we shouldn't store these variables in field_t. field_t is effectively a scratch space and sometimes these variables don't have a meaning. These assignments in particular can cause annoying bugs.

semi-h · 2024-04-10T13:20:06Z

src/omp/backend.f90

+    do k = 1, u%n_groups
+      do j = 1, u%n
+        do i = 1, u%sz


To simplify the reorder_omp how about introducing a variable n_groups(3) in the allocator_t class and;

do k = 1, self%allocator%n_groups(u%dir) do j = 1, self%allocator%cdims(u%dir) do i = 1, self%sz (which is from backend%sz or SZ directly)

It doesn't require us passing anything extra and removes all the select case stuff above.

JamieJQuinn · 2024-04-16T15:45:08Z

Seems like there are a few different definitions of n used here:

Global n{x,y,z} - entire simulation domain across all nodes
Local n{x,y,z} - local simulation domain unique to one node
Padded local n{x,y,z} - actual size of array used to store local domain
Pressure/velocity n{x,y,z} - either n or n-1 depending on staggeredness of variable

So questions to quell my quonfusion:

What is the value intended to be stored in field_t%n and field_t%n_padded @Nanoseb? The actual size of the field, including staggeredness?
What should tdsops%n contain @semi-h? It looks like it's always local, pressure grid n{x,y,z}. So we can't naively derive this from field_t%n because field_t%n can either be pressure or velocity size? Would having field_t%n help with error handling, e.g. if someone passes in the wrong staggeredness of array?
Unsure why field_t%n wouldn't contain the right n for the halo exchange @semi-h. Can you elaborate?

FWIW it feels like having this information about field within field is sensible but I agree with @semi-h that it can't be misleading. I wonder if there's a better naming scheme we can use to distinguish between the above kinds of nx we have floating around.

slaizet · 2024-04-16T17:27:10Z

I think we should discuss about n and a proper naming scheme in our next meeting on Monday. We can get some inspirations from 2DECOMP&FFT.

Nanoseb · 2024-04-22T13:02:10Z

I think there are two different questions here, where are these variable stored and how they are called.
The main purpose of this PR was to put everything in the fields. Most of the times when we act on fields we want to loop through them, so having all the sizes we would need at hand was the main drive for this work. That's the reason why, even if it leads to some duplication of quantities, I still think it makes sense to have it accessible via the field object.

Concerning the naming conventions, I agree this needs to be clarified/improved. One of the reason it isn't really clear is because get_block doesn't distinguish between velocity (n) and pressure (n+1) fields. Something we could do is, just like we do with the direction, to ask for either a cell based block or a vertex based one when requesting get_block. Requesting either will set the right value in n.
That way, we can always assume that n is the size we need either in a pressure or velocity array.

As we mentioned in the related issue, the plan was to remove the _loc from n and add _global whenever we talk about global sizes. So any n here is local.

For SZ, I will have a think on how we could do it, but indeed, if it means we can't vectorise any loop anymore that's not ideal.

semi-h · 2024-04-25T10:10:53Z

So questions to quell my quonfusion:

...
2. What should tdsops%n contain @semi-h? It looks like it's always local, pressure grid n{x,y,z}. So we can't naively derive this from field_t%n because field_t%n can either be pressure or velocity size? Would having field_t%n help with error handling, e.g. if someone passes in the wrong staggeredness of array?

tdsops%n is not actually always local pressure grid size. It depends on the type of operation the tdsops_t instance is set to carry out. For example, we have a first derivative operator, der1st, and with this one we input velocity grid data and output velocity grid data, so in this case der1st%n is equal to the velocity grid. But sometimes we input velocity grid data but output on pressure grid, then, tdsops%n is the pressure grid size. Also, pressure grid is not always n_{velocity} - 1, it actually depends on the boundary condition and the location of the subdomain in the global domain. Only the subdomains at the end (w.r.t x, y, and z) might have a pressure grid size n_{velocity} - 1, the rest are identical to the velocity grid.

Unsure why field_t%n wouldn't contain the right n for the halo exchange @semi-h. Can you elaborate?

If field_t%n is set to the velocity grid size, sending the last 4 layers of a pressure field to the next rank can be a problem. Lets assume we have a field with 256 velocity points along primary direction, and 255 pressure points. Then we might want to send pressure points 252 253 254 255 to the next rank. But if we have field_t%n=256, and take a slice like field%data(i, u%n - n_halo + j, k) (j from 1 to 4), we'll access data at 253 254 255 256, and this is not the last 4 layers of the pressure field.

I think there are two different questions here, where are these variable stored and how they are called. The main purpose of this PR was to put everything in the fields. Most of the times when we act on fields we want to loop through them, so having all the sizes we would need at hand was the main drive for this work. That's the reason why, even if it leads to some duplication of quantities, I still think it makes sense to have it accessible via the field object.

Concerning the naming conventions, I agree this needs to be clarified/improved. One of the reason it isn't really clear is because get_block doesn't distinguish between velocity (n) and pressure (n+1) fields. Something we could do is, just like we do with the direction, to ask for either a cell based block or a vertex based one when requesting get_block. Requesting either will set the right value in n. That way, we can always assume that n is the size we need either in a pressure or velocity array.

Passing an extra argument when calling get_block to set field_t%n correctly is an option, but this will only increase the complexity in my opinion. As of now the issue with different grid size is nicely hidden below in the backends, it works just fine and people working at solver level don't have to worry about it at all. Also, this can be a problem when we request a field at a level different than solver. For example we request intermediate storages in transeq, time integrator, and possibly more. If we plan to rely on field_t%n we would have to make sure that it is always correctly set every single time, this is not only error prone but also the correct value of n or accessing this info may not be available at all these levels.

Also, if we require user to specify a field wheter it is pressure or velocity, this is an extra responsibilty we give to a user. For example, to get a derivative in the staggared grid, user will have to take two distinct but corralated actions to get a single desired output. First specifying that the field is pressure or velocity, then calling tds_solve with the right tdsops_t instance. However we don't require this as of now and it works fine.

I think we need to focus on where exactly in the codebase we use the proposed members %SZ, %n, and %n_groups of field_t, and maybe consider alternative solutions. If their use is limited, then maybe big changes in the codebase (particularly around tds_solve) may not worth it. In the PR there is a loop in reorder_omp makes use of these members and its very helpful there, and also some of these members are used in copy_into_buffers, with minor changes. Of course this is only a draft and not complete, so what other subroutines you think these members will be useful?

Nanoseb · 2024-05-16T10:36:43Z

Going via an other route #91 so closing this PR

Nanoseb added 3 commits April 9, 2024 15:29

add dimentions to allocator

5a5a25d

use allocator dims throughout codebase

7649de5

fix tests

3f40915

Nanoseb added the core Issue affecting core mechanisms of the software label Apr 9, 2024

Nanoseb self-assigned this Apr 9, 2024

semi-h mentioned this pull request Apr 10, 2024

Global-local grid indexing #84

Merged

semi-h reviewed Apr 10, 2024

View reviewed changes

semi-h mentioned this pull request May 14, 2024

Naming and storage convention for grid information #91

Closed

Nanoseb closed this May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move dimensions to allocator #85

Move dimensions to allocator #85

Nanoseb commented Apr 9, 2024

semi-h commented Apr 10, 2024

semi-h left a comment

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h left a comment

semi-h Apr 10, 2024

semi-h Apr 10, 2024

semi-h Apr 10, 2024

JamieJQuinn commented Apr 16, 2024

slaizet commented Apr 16, 2024

Nanoseb commented Apr 22, 2024 •

edited

Loading

semi-h commented Apr 25, 2024

Nanoseb commented May 16, 2024


		self%nx = xdirps%n; self%ny = ydirps%n; self%nz = zdirps%n
		sz = allocator%sz

	call self%backend%tds_solve(du_x, u, self%xdirps, &
	self%xdirps%stagder_v2p)

	call exec_dist_tds_compact( &
	du_dev, u_dev, &
	self%u_recv_s_dev, self%u_recv_e_dev, &
	self%du_send_s_dev, self%du_send_e_dev, &
	self%du_recv_s_dev, self%du_recv_e_dev, &
	tdsops_dev, dirps%nproc, dirps%pprev, dirps%pnext, &
	blocks, threads &
	)

	call der_univ_dist<<<blocks, threads>>>( & !&
	du, du_send_s, du_send_e, u, u_recv_s, u_recv_e, &
	tdsops%coeffs_s_dev, tdsops%coeffs_e_dev, tdsops%coeffs_dev, &
	tdsops%n, tdsops%dist_fw_dev, tdsops%dist_bw_dev, tdsops%dist_af_dev &
	)

	call copy_into_buffers(self%u_send_s_dev, self%u_send_e_dev, u_dev, &
	tdsops_dev%n)

Move dimensions to allocator #85

Move dimensions to allocator #85

Conversation

Nanoseb commented Apr 9, 2024

semi-h commented Apr 10, 2024

semi-h left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

semi-h left a comment

Choose a reason for hiding this comment

Halo exchange

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

JamieJQuinn commented Apr 16, 2024

slaizet commented Apr 16, 2024

Nanoseb commented Apr 22, 2024 • edited Loading

semi-h commented Apr 25, 2024

Nanoseb commented May 16, 2024

Nanoseb commented Apr 22, 2024 •

edited

Loading