-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move dimensions to allocator #85
Conversation
The main change here is introducing new member variables and removing some of them, involving
There is also a new member |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here I focused entirely on SZ
.
backend%yblocks = dim3(globs%n_groups_y, 1, 1) | ||
backend%zthreads = dim3(SZ, 1, 1) | ||
backend%zblocks = dim3(globs%n_groups_z, 1, 1) | ||
sz = allocator%sz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, here I would assign the allocator%sz
into a new sz
member of the backend_t
if we wan't to avoid using SZ
parameter from m_omp_common
/m_cuda_common
. In all the use cases of field_t%sz
the PR introduces we can instead do self%sz
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
However, I think using the SZ
parameter directly from m_omp_common
/m_cuda_common
is a better idea, but I don't really have a strong opinion on this. I think if we're already under omp/
or cuda/
, its fine to rely on the corresponding SZ parameter directly. Because otherwise we use sz
from allocator_t
to set a very critical value for us.
@@ -360,7 +365,7 @@ subroutine tds_solve_cuda(self, du, u, dirps, tdsops) | |||
error stop 'DIR mismatch between fields and dirps in tds_solve.' | |||
end if | |||
|
|||
blocks = dim3(dirps%n_blocks, 1, 1); threads = dim3(SZ, 1, 1) | |||
blocks = dim3(u%n_groups, 1, 1); threads = dim3(u%sz, 1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example here sz
doesn't depend on the field
u
. This is always the same SZ
that practically backend instantiation freezes.
|
||
self%nx = xdirps%n; self%ny = ydirps%n; self%nz = zdirps%n | ||
sz = allocator%sz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With #83 possion solver doesn't require SZ
any more, just wanted to give an update.
@@ -2,7 +2,6 @@ module m_omp_exec_dist | |||
use mpi | |||
|
|||
use m_common, only: dp | |||
use m_omp_common, only: SZ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not being able to access SZ
as a parameter will probably make performance worse, because compiler won't be certain whether the simd
loops can be vectorised or not at compile time.
@@ -127,15 +128,15 @@ subroutine exec_dist_transeq_compact( & | |||
! Handle dud by locally generating u*v | |||
do j = 1, n | |||
!$omp simd | |||
do i = 1, SZ | |||
do i = 1, sz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example here we basically tell compiler to do its best to vectorise the loop here. But if the loop here goes from 1 to 7 lets say, its impossible for compiler to vectorise this. And if sz
here is a variable not a parameter, compiler can't be sure at compile time that this is a good integer for vectorisation. In the best case scenario it'll generate multiple versions of assembly, and will have to decide which one to execute only at runtime, which is not ideal. It may alltogether refuse to generate vectorised assembly, and this will depend on the compiler. So here I recommend using SZ
as a parameter directly from m_omp_common
.
allocate (ud_recv_e(SZ, n_halo)) | ||
allocate (ud_recv_s(SZ, n_halo)) | ||
allocate (ud(sz, n)) | ||
allocate (ud_recv_e(sz, n_halo)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another minor example, if we have both sz
and n_halo
as a parameter, it'll probably be better for performance as the size of the allocation can be determined at the compile time. n_halo
is not a parameter yet, but we can actually make it, its always a constant value in backends and there are places we hardcoded things based on n_halo=4
.
@@ -177,7 +178,7 @@ subroutine exec_dist_transeq_compact( & | |||
|
|||
do j = 1, n | |||
!$omp simd | |||
do i = 1, SZ | |||
do i = 1, sz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just another example where a parameter SZ
will make performance better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm starting a new review specifically for discussing field_t%n
.
I tried to avoid inferring the actual size the tridiagonal solver will make use from the domain shape. This is mainly because of the difference between velocity grid and pressure grid.
I'll try to give an example:
We call tds_solve in various settings and often we pass a velocity grid data and get back pressure grid data such as
Lines 284 to 285 in ce2efbc
call self%backend%tds_solve(du_x, u, self%xdirps, & | |
self%xdirps%stagder_v2p) |
In this case, its very important to use the correct n
in the tds_solve
.
Line 369 in ce2efbc
subroutine tds_solve_dist(self, du, u, dirps, tdsops, blocks, threads) |
The
tdsops
class we have is the key here. It has got an n
member where the size of the tridiagonal system is defined. Its always the correct value, either n
of velocity or n
of pressure because it is literally the class that defines the operation. It is determined based on the operation is a _p2v
or _v2p
or so on. We pass this down to exec_dist
via tdsops_dev
in #L409 below.Lines 404 to 411 in ce2efbc
call exec_dist_tds_compact( & | |
du_dev, u_dev, & | |
self%u_recv_s_dev, self%u_recv_e_dev, & | |
self%du_send_s_dev, self%du_send_e_dev, & | |
self%du_recv_s_dev, self%du_recv_e_dev, & | |
tdsops_dev, dirps%nproc, dirps%pprev, dirps%pnext, & | |
blocks, threads & | |
) |
And evantually we call the kernel with
tdsops%n
below in #L43.Lines 40 to 44 in ce2efbc
call der_univ_dist<<<blocks, threads>>>( & !& | |
du, du_send_s, du_send_e, u, u_recv_s, u_recv_e, & | |
tdsops%coeffs_s_dev, tdsops%coeffs_e_dev, tdsops%coeffs_dev, & | |
tdsops%n, tdsops%dist_fw_dev, tdsops%dist_bw_dev, tdsops%dist_af_dev & | |
) |
Halo exchange
The actual n
of the field is also important in halo exchanges. Because if we can't determine the actual n
, whether its pressure or velocity, we can't send the right last 4 slices to the neighbouring ranks.
Lines 395 to 396 in ce2efbc
call copy_into_buffers(self%u_send_s_dev, self%u_send_e_dev, u_dev, & | |
tdsops_dev%n) |
That's why we need to pass an
n
into the copy_into_buffers
, so that the last 4 layers are accessed correctly, based on its pressure or velocity data. See below.Line 412 in ce2efbc
u_send_e(i, j, k) = u(i, n - n_halo + j, k) |
Because of this we need to pass an
n
value to the copy_into_buffers
, and can't assume its field_t%n
.
I think having a field_t%n
is not a good idea because of all these. It'll be misleading and the n
in a given direction can be stored in dirps_t%n
. When its safe to do so we can use it, or when necessary we'll have to access tdsops_t%n
.
u_send_e(i, j, k) = u(i, n - n_halo + j, k) | ||
do i = 1, u%sz | ||
u_send_s(i, j, k) = u%data(i, j, k) | ||
u_send_e(i, j, k) = u%data(i, u%n - n_halo + j, k) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In particular here, we can't have field_t%n
. We have to pass an n
value into this subroutine which is determined from the corresponding tdsops
, when copy_into_buffers
is called in tds_solve
. tds_solve
always have a tdsops
, and passing the tdsops%n
down to copy_into_buffers
is necessary. Otherwise we'll end up sending wrong set of slices.
handle%n = -1 | ||
handle%SZ = -1 | ||
handle%n_padded = -1 | ||
handle%n_groups = -1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this also suggests that we shouldn't store these variables in field_t
. field_t
is effectively a scratch space and sometimes these variables don't have a meaning. These assignments in particular can cause annoying bugs.
do k = 1, u%n_groups | ||
do j = 1, u%n | ||
do i = 1, u%sz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To simplify the reorder_omp
how about introducing a variable n_groups(3)
in the allocator_t
class and;
do k = 1, self%allocator%n_groups(u%dir)
do j = 1, self%allocator%cdims(u%dir)
do i = 1, self%sz (which is from backend%sz or SZ directly)
It doesn't require us passing anything extra and removes all the select case stuff above.
Seems like there are a few different definitions of
So questions to quell my quonfusion:
FWIW it feels like having this information about field within field is sensible but I agree with @semi-h that it can't be misleading. I wonder if there's a better naming scheme we can use to distinguish between the above kinds of |
I think we should discuss about |
I think there are two different questions here, where are these variable stored and how they are called. Concerning the naming conventions, I agree this needs to be clarified/improved. One of the reason it isn't really clear is because As we mentioned in the related issue, the plan was to remove the For |
If
Passing an extra argument when calling Also, if we require user to specify a field wheter it is pressure or velocity, this is an extra responsibilty we give to a user. For example, to get a derivative in the staggared grid, user will have to take two distinct but corralated actions to get a single desired output. First specifying that the field is pressure or velocity, then calling I think we need to focus on where exactly in the codebase we use the proposed members |
Going via an other route #91 so closing this PR |
closes #77