Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vertical loop fusion and demotion of temporaries #374

Merged
merged 15 commits into from
Oct 14, 2024

Conversation

MichaelSt98
Copy link
Collaborator

@MichaelSt98 MichaelSt98 commented Sep 10, 2024

Vertical loop fusion and demotion of temporaries in the vertical dimension if possible based on
pragmas in the input source code and heuristics.

This is intended to be a starting point for a more automated and sophisticated approach (including dataflow analysis) as well as further optimisations that require this as a first step.

To test with CLOUDSC, checkout branch nams-vertical-loop-fusion.
This results in the following performance improvements (executed with 1 262144 128):

VARIANT ORIG/WITHOUT WITH NEW TRAFO
SCC ca. 200 GF/s ca. 460 GF/s
SCC-HOIST ca. 320 GF/s ca. 500 GF/s
SCC-STACK ca. 320 GF/s ca. 520 GF/s

Copy link

Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/374/index.html

Copy link

codecov bot commented Sep 10, 2024

Codecov Report

Attention: Patch coverage is 98.48943% with 5 lines in your changes missing coverage. Please review.

Project coverage is 95.56%. Comparing base (1d95896) to head (7a65ebe).
Report is 36 commits behind head on main.

Files with missing lines Patch % Lines
loki/transformations/transform_loop.py 96.29% 3 Missing ⚠️
...ormations/single_column/tests/test_scc_vertical.py 97.97% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #374      +/-   ##
==========================================
+ Coverage   95.54%   95.56%   +0.02%     
==========================================
  Files         188      190       +2     
  Lines       39468    39691     +223     
==========================================
+ Hits        37709    37930     +221     
- Misses       1759     1761       +2     
Flag Coverage Δ
lint_rules 96.39% <ø> (ø)
loki 95.54% <98.48%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@MichaelSt98 MichaelSt98 marked this pull request as ready for review September 10, 2024 15:51
Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks really great, and is a first step towards some very advanced optimisations. I've left a few comments, but they are mostly cosmetic. Once addressed, I think this should be good to go.

The direct inclusion in SCC also ok for now, as it will basically become a no-op unless any pragma markers are inserted. In the future, we might want to add this stuff to a more generic non-SCC loop-change transformation that can be prepended to SCC in config, or run in it's own right (eg. for CPU opts). But that is beyond this PR.

routine : :any:`Subroutine`
The subroutine in the vertical loops should be fused and
temporaries be demoted.
horizontal : :any:`Dimension`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the horizontal is not actually used here.


.. note::
This transfomation currently relies on pragmas being inserted in the input
source files.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be good to briefly explain the pragma annotations needed. For example, listing and linking loop-interchange and loop-fusion notation, as well as the custom "-init" and "ignore" notations.

loop_var_map = {}
with pragmas_attached(routine, ir.Loop):
for loop in FindNodes(ir.Loop).visit(routine.body):
if loop.variable.name.lower() == self.vertical.index.lower():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the .lower() really needed here? Are there edge-cases where expression-string comparison not trigger?

parameters = get_pragma_parameters(loop.pragma, starts_with='fused-loop')
group = parameters.get('group', 'default')
if group == 'ignore':
ignore = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of tracking ignore, I think you could simply continue here, no?

if group == 'ignore':
ignore = True
if not ignore:
loop_var_map[loop] = as_tuple(var.name.lower() for var in FindVariables().visit(loop.body)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would building the inverse of this map not make the logic much simple? As in var_loop_map[var.name] = loop would allow you to then simply check for len(var_loop_map[var.name]) > 1 in the next part? Or am I misreading this?

Also, a CaseInsensitiveDict might be useful here for var_loop_map.

# look for 'loki k-caching ignore(var1, var2, ...)' pragmas within routine and ignore those vars
for pragma in pragmas:
if is_loki_pragma(pragma, starts_with='k-caching'):
pragma_ignore = get_pragma_parameters(pragma, starts_with='k-caching').get('ignore', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: I wonder if the is_loki_pragma and get_pragma_parameters can be combined to something like if pragma_ignore := get_pragma_parameters(pragma, starts_with='k-caching').get(...). Not strictly necessary, just wondering...


def correct_init_of_multilevel_arrays(self, routine, multilevel_local_arrays):
"""
Possibly handle initaliztion of those multilevel local arrays via
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "initialization"

new_pragmas = [pragma.clone(content=pragma.content.replace('-init', '')) if '-init'
in pragma.content else pragma for pragma in pragmas]
loop_map[loop] = (ir.Comment('! Loki generated loop for init ...'),
Transformer(node_map_init).visit(loop.clone(\
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two inline Transformer invocations makes this very hard to read. Can we apply them in sequence to build the tuple, before putting them into the loop_map?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean.
I need to apply the Transformer twice with two different maps since the key would be overwritten otherwise ...
If you propose to introduce temporary variables: sure this is possible!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I just meant the cosmetics of it - the use of two transformers is fine, it just gets really hard to read if everything is applied in a single line. If you build the tuple incrementally, each Transformer.visit(..) invocation could be a single line - but this is really just a cosmetic suggestion and not strictly necessary.

loki/transformations/transform_loop.py Show resolved Hide resolved
@MichaelSt98
Copy link
Collaborator Author

Thanks for reviewing @mlange05. I addressed all your requested changes.
I am thinking about introducing a flag to activate or deactivate that transformation. It is not completely true that this is a no-op unless any pragma markers are inserted. Without pragma markers no loop-interchange and loop-fusion is performed, however, it would still look for temporaries that could be demoted (so in principle this transformation could optimise even without inserted pragmas if some temporaries unnecessarily have a vertical dimension). I don't know how much time it would take to execute this performance for e.g., ecwam or ecphys and/or whether the result is correct. Therefore, I should at least check whether it works with ecwam/ecphys and/or introduce a flag to activate or deactivate. What do you think?

@mlange05
Copy link
Collaborator

mlange05 commented Oct 6, 2024

Ok, nitpicks and code stye comments have all been addressed. Many thanks for this and over to @reuterbal for final approval.

As for the on/off switch, I see you argument, and I think having a safety switch built in is a good idea. Would you mind adding this please to the constructor? I'll check offline if having it enabled affects EC-physics. Once that switch is added, I think this is ready.

On a wider note, I think that once we refactor this a little, we can have the loop-pragma steps as a separate "pre-processing Trafo" that can be added to the pipeline in the config files, rather than be part of SCC. Demotion of the vertical would then simply require additional parameters to the SCCDemote step - but all of this if beyond this initial PR.

Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for a very comprehensive and nicely structured implementation and the round of review and fixes already performed! This looks already very good to me. I've left a few (minor) questions and suggestions, and a comment for one potential source of problems.

Comment on lines +72 to +81
1. loop interchange to expose vertical loops
2. fuse vertical loops (possibly into multiple groups)
3. find local arrays to be demoted and apply heuristics to check whether this is safe
4. demote those arrays which are safe to be demoted
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice docstring! 👏

loki/transformations/single_column/vertical.py Outdated Show resolved Hide resolved
loki/transformations/transform_loop.py Show resolved Hide resolved
loki/transformations/transform_loop.py Outdated Show resolved Hide resolved
loki/transformations/transform_loop.py Show resolved Hide resolved
Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice, many thanks! :shipit:

@reuterbal reuterbal added the ready for merge This PR has been approved and is ready to be merged label Oct 8, 2024
@mlange05
Copy link
Collaborator

Ok, sorry for the delay. I've just run a locally rebased version over this with ecphs-regression, and I can confirm that it works and is performance neutral. I'll trigger one more rebase, and then GTG from me.

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great now, and confirmed safe and fast with ecphys-regression. GTG. :shipit:

@reuterbal reuterbal merged commit b6eb817 into main Oct 14, 2024
13 checks passed
@reuterbal reuterbal deleted the nams-vertical-loop-fusion branch October 14, 2024 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for merge This PR has been approved and is ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants