Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow tunring off DBCSR ACC with env variable #801

Merged
merged 3 commits into from
Jun 12, 2024
Merged

Conversation

abussy
Copy link
Contributor

@abussy abussy commented May 30, 2024

This PR allows turning off DBCSR GPU acceleration at initialization, even if the library is compiled with the -D__DBCSR_ACC flag. This can be done by setting the environment variable DBCSR_TURN_OFF_ACC=1, or by passing a logical to the dbcsr_set_config subroutine (this will be useful for a future CP2K keyword).

Most changes take place in the dbcsr_config.F file. There, a new function called use_acc() is defined. It returns a logical, and replaces the has_acc variable used in the rest of the code.

The PR was validated by successfully running the CP2K regtests on a GPU machine, with and without the DBCSR_TURN_OFF_ACC=1 environment variable.

Generally, the ability of turning off GPU acceleration at runtime should help with debugging, and with edge cases where DBCSR ACC under-performs (see #795).

@jenkins-cscs
Copy link
Collaborator

Can one of the admins verify this patch?

@hfp
Copy link
Member

hfp commented May 30, 2024

With the environment variable DBCSR_N_STACKS=0 existing already, does it make this PR superfluous or do you still want it?

@alazzaro
Copy link
Member

We still need it since we still move data to the gpu and run transpose kernel.

Thanks @abussy , I will review the PR next week.

@abussy
Copy link
Contributor Author

abussy commented Jun 11, 2024

Gentle ping to review this PR :) It would be nice if it makes it to CP2K's next release

@alazzaro
Copy link
Member

thanks for the PR.
I will start the review, but first let me put here some general comments for my understanding of the proposed changes:

  1. we are replacing has_acc boolean parameter with use_acc function
  2. has_acc is used within the use_acc and when setting the default for mm_dense
  3. We use the new conf flag TURN_OFF_ACC to disable the GPU when DBCSR is compiled for GPU

OK, the first "logical" problem is TURN_OFF_ACC has to be TRUE to disable the GPU. I would prefer the opposite meaning, so we can use RUN_ON_GPU so it is TRUE by default whenever has_acc is TRUE and people can set to FALSE if you don't want to run on the GPU (we are not really "turning off" the GPU, simply we are not running on it...).

In terms of output, I will keep has_acc when we output the configuration.

The last problem is that it must forbidden to change this value multiple times within a DBCSR run, otherwise will get configuration problems (we are replacing a macro at compile time with something at runtime). This functionality is not available in the library, I need to think about (in any case this is not relevant to your PR).

@abussy
Copy link
Contributor Author

abussy commented Jun 11, 2024

Sure, using a positive boolean like RUN_ON_GPU makes sense, I agree.

Concerning the danger of changing this value multiple times during a run, I think it should be safe as it is. The use_acc() function is defined during the DBCSR configuration step.

CALL dbcsr_cfg%mm_stack_size%set(mm_stack_size)
END IF

IF (.NOT. PRESENT(mm_stack_size) .AND. dbcsr_cfg%turn_off_acc%val) THEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be mm_dense here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same consideration about resetting the value...

src/core/dbcsr_config.F Outdated Show resolved Hide resolved
src/core/dbcsr_config.F Outdated Show resolved Hide resolved
src/core/dbcsr_config.F Outdated Show resolved Hide resolved
@alazzaro
Copy link
Member

Concerning the danger of changing this value multiple times during a run, I think it should be safe as it is. The use_acc() function is defined during the DBCSR configuration step.

The dbcsr_set_config is a public function, so people can call it after the init. Actually, this is what CP2K does. That means we can call it once with GPU_RUN disabled, then run a multiplication, then call set_config and set GPU_RUN to true and at this point I'm not sure what it will happen... We need to protect for this case, as said this is not part of this PR (I will open an issue as a reminder).

@abussy
Copy link
Contributor Author

abussy commented Jun 12, 2024

I implemented the vast majority of the suggested changes:

  • the environment variable is now called RUN_ON_GPU
  • has_acc is used to control the printing of ACC paramters, and not use_acc()
  • the information about whether GPU support is enabled or not is printed with the rest of ACC info
  • no more hardcoded CPU defaults (e.g. mm_stack_size of 1000) with the introduction of new paramter variables (e.g. mm_stack_default_size_cpu)

@alazzaro alazzaro merged commit 0f47720 into cp2k:develop Jun 12, 2024
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants