Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split reads and writes for certain accumulation patterns #329

Merged
merged 4 commits into from
Aug 14, 2024

Conversation

awnawab
Copy link
Contributor

@awnawab awnawab commented Jun 12, 2024

Lukas M. demonstrated that for accumulation patterns of the type:

do jlon=1,nproma
   var(jlon, n1) = var(jlon, n1) + 1.
   var(jlon, n2) = var(jlon, n2) + 1.
enddo

a compiler cannot rule out the possibility that n1 and n2 do not in fact alias the same location. In such a case, it is unable to run the loads and stores out-of-order.

This PR contributes a pragma assisted transformation to split the above into separate reads and writes, thereby removing the dependency between subsequent loads and stores and allowing the compiler to optimise more effectively.

Copy link

Documentation for this branch can be viewed at https://sites.ecmwf.int/docs/loki/329/index.html

Copy link

codecov bot commented Jun 12, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 95.38%. Comparing base (a0dade4) to head (ad1d51a).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #329      +/-   ##
==========================================
+ Coverage   95.36%   95.38%   +0.01%     
==========================================
  Files         173      175       +2     
  Lines       36914    37039     +125     
==========================================
+ Hits        35204    35330     +126     
+ Misses       1710     1709       -1     
Flag Coverage Δ
lint_rules 96.39% <ø> (ø)
loki 95.36% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@awnawab awnawab force-pushed the naan-split-read-write branch 4 times, most recently from 5eb2d3e to 4974f6b Compare June 13, 2024 09:21
Copy link
Collaborator

@reuterbal reuterbal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this is really neat (and many thanks to @lukasm91 for pointing this out in the first place!). However, the implementation appears a bit too complicated and costly and I'm suggesting an alternative in the comments. Please have a look if that makes sense.


.. code-block:: fortran

!$loki split read-write
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd argue the "command" should maybe be one word, as this makes for a more intuitive grammar for the directives. I.e., in this case maybe use split-read-write?

for region in FindNodes(ir.PragmaRegion).visit(routine.body):
if is_loki_pragma(region.pragma, starts_with='split read-write'):

# find assignments inside pragma region
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like this requires quite a lot of IR traversals for something that should be doable in a single pass. For small regions that shouldn't make a big difference, but if there are many then it might pile up. I'm counting at least 6 passes over the region body (in a quick eyeball-estimate).

How about a single tree walk that in-place updates the assignments that need splitting, and collects all assignments at the end?

Drafting here a bit of pseudo code to illustrate what I mean:

class SplitReadWriteWalk(Transformer):

    def __init__(self, **kwargs):
        self.write_assignments = []
        kwargs['inplace'] = True
        super().__init__(**kwargs)

    def visit_Loop(self, o, **kwargs):
        # This collects the relevant loop dimensions in kwargs['dim_nest'] during the
        # tree walk
        dim = [d for d in self.dimensions if d.index == o.variable]
        dim_nest = kwargs.pop('dim_nest', [])
        return super().visit_Node(o, dim_nest=dim_nest + dim, **kwargs)

    def visit_Assignment(self, o, **kwargs):
        if isinstance(o.lhs, Array) and a.lhs.name in o.rhs:
            parent_loop_dims = kwargs.pop('dim_nest', [])
            # <do the logic for shape etc here>
            tmp_var = <...>
            self.write_assignments += [ir.Assignment(lhs=o.lhs, rhs=tmp_var)
            o._update(lhs=tmp_var)
        return o

for region in ...:
    transformer = SplitReadWriteWalk()
    region.body = transformer.visit(region.body)
    if transformer.write_assignments:
        region.body.append(transformer.write_assignments)

Not sure if I'm overlooking something, but it feels to me like that should be sufficient and much more efficient.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, tree-walk is definitely the way to do it! The use of FindScopes to discover a parent loop nest felt a bit clunky to me too but the idea of implementing this as a visitor never occured to me. Many thanks for the suggestion 🙏

@awnawab awnawab force-pushed the naan-split-read-write branch 2 times, most recently from 57068e3 to f4f1496 Compare August 7, 2024 15:30
@awnawab
Copy link
Contributor Author

awnawab commented Aug 7, 2024

Many thanks again @reuterbal for the great tip! I've now implemented the SplitReadWriteWalk that you suggested. I've retained a second pass to insert the newly split write assignments. These have to be nested correctly within the control flow, and the simplest way I could think of achieving this was to copy the entire PragmaRegion and amend the leaf nodes as necessary.

@mlange05 could you also please review the PR? Thanks!

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great utility and very neatly done with the single-pass Transformer. I've only left a few small nit-pick comments about docstrings (and an "out of interest" question), but otherwise this is ready, I think. Well done, and many thanks to both, @awnawab and @reuterbal

"""
A :any:`Transformer` class to traverse the IR, in-place replace read-write
assignments with reads, and build a transformer map for the corresponding writes.
"""
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor nitpick: Could we please add the constructor argument to the docstring? This ones takes a lot of arguments, and their meaning is not immediately obvious without context.

item_filter = (ProcedureItem,)

def __init__(self, dimensions):
self.dimensions = as_tuple(dimensions)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here - could we please add the dimension arg to the docstring? It's not immediately obvious why this is needed, and what dimension is required here from users.

@@ -186,7 +186,9 @@ def convert(
mode = mode.replace('-', '_') # Sanitize mode string

# Write out all modified source files into the build directory
file_write_trafo = FileWriteTransformation(builddir=build, mode=mode)
file_write_trafo = scheduler.config.transformations.get('FileWriteTransformation', None)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[no action] Out of interest, what corner case makes this necessary? Or is this just a piggy-backed clean-up step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For ecWAM I need to enable the include_module_var_imports option in the FileWriteTransformation. I'm assuming global_var_offload will eventually be removed as an argument from convert as we further simplify loki_transform.py, so I thought I might as well make the FileWriteTransformation also configurable.

@awnawab
Copy link
Contributor Author

awnawab commented Aug 13, 2024

Many thanks @mlange05 for the review and the feedback 😄 The constructor arguments were indeed unclear, and I've now added documentation for them.

@mlange05
Copy link
Collaborator

Thanks, this looks great now. Going in...

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good now, going in.

@mlange05 mlange05 added the ready for merge This PR has been approved and is ready to be merged label Aug 13, 2024
@mlange05 mlange05 merged commit cb2d5d5 into main Aug 14, 2024
13 checks passed
@mlange05 mlange05 deleted the naan-split-read-write branch August 14, 2024 08:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready for merge This PR has been approved and is ready to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants