Fast Transposes in XLA:GPU #10

cheshire · 2022-08-31T13:01:03Z

cheshire
Aug 31, 2022

This outlines new proposed approach for the transposition emitter in XLA:GPU.

Motivation

Transposes are known to be a “weak point” of the XLA:GPU compiler. Many fusions containing transposes run at 1-10% of the throughput capacity, often becoming the “long pole” of the HLO module.

Moreover, the work on layout normalization is blocked by the current transposition emitter, as the current emitter inherently only supports physical, but not logical transpositions.

Current State

Currently we have a fragile heuristic inside XLA:

During fusion, we do not treat transposes separately
During the codegen, we attempt to pattern-match the following property:
1. Output shape is physical (but not logical) transpose of one of the inputs
2. All the instructions inside the fusions are “effectively” elementwise: they should not affect the index computation.
For such computation, we have a pattern-matching emitter performing 0-2-1 transposition (algorithm from An Efficient Matrix Transpose in CUDA C/C++ | NVIDIA Technical Blog).

In practice, the property (ii) is ~never satisfied by any non-trivial benchmarks, and turning off the transposition emitter entirely has ~no effect on XLA benchmarks. Consequently, fusions containing transposes run very poorly relative to the roofline throughput capacity.

Previous Approaches

An approach has been tried to heuristically switch the iteration order of the elementwise emitter from logical to physical, which may speed up the generated code if both input and output have the same “physical” layout, but different logical one. That speeds up the example, but has detrimental effects on other nets, and added complexity to the emitter. The approach was removed in tensorflow/tensorflow@893e64a

A different approach attempted to only allow the fusion of transposes if they satisfy properties (i) and (ii), and to cut fusions otherwise. That has shown large performance gain on many nets, but has slowed down the others. Considering that (ii) is a very narrow property (broken by any bitcast, broadcast or reshape) it’s not surprising that the change negatively affects legitimate fusions.

Proposed Design

We can handle the transposition effectively the same way we handle reductions: form the kInput fusion when we see a transposing operation, do not allow other transposing operations inside, and keep track of the shared memory budget. We additionally add a new pass to "sink" the kCopy operation as far down as possible to form larger fusions.

During code generation, use the tiled emitter for transpositions: replace the access to parameters with access to tiles, which will perform the 0-2-1 transposition to shared memory first.

Preliminary Results

Many benchmarks show ~15% gain from the change, investigation still ongoing

cheshire · 2022-08-31T13:01:58Z

cheshire
Aug 31, 2022
Author

@nouiz @trentlo

2 replies

nouiz Sep 1, 2022
Collaborator

A different approach attempted to only allow the fusion of transposes if they satisfy properties (a) and (b),
I do not see a and b defined above. What they are?

cheshire Sep 1, 2022
Author

I do not see a and b defined above. What they are?

Sorry I'm new to this markdown thing. Meant (i) and (ii).

cheshire · 2022-09-05T09:41:05Z

cheshire
Sep 5, 2022
Author

Submitted in tensorflow/tensorflow@9f89ac6

2 replies

vinodgro Sep 5, 2022

I am wondering if this is related to the problem being discussed above? https://dl.acm.org/doi/abs/10.1145/2692916.2555253.

The code for this in place transposition is here https://github.com/bryancatanzaro/trove

cheshire Sep 6, 2022
Author

@vinodgro our emitter is much simpler, it does out-of-place transposition, and it only pattern-matches a particular transposition pattern.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fast Transposes in XLA:GPU #10

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Fast Transposes in XLA:GPU #10

cheshire Aug 31, 2022

Motivation

Current State

Previous Approaches

Proposed Design

Preliminary Results

Replies: 2 comments · 4 replies

cheshire Aug 31, 2022 Author

nouiz Sep 1, 2022 Collaborator

cheshire Sep 1, 2022 Author

cheshire Sep 5, 2022 Author

vinodgro Sep 5, 2022

cheshire Sep 6, 2022 Author

cheshire
Aug 31, 2022

Replies: 2 comments 4 replies

cheshire
Aug 31, 2022
Author

nouiz Sep 1, 2022
Collaborator

cheshire Sep 1, 2022
Author

cheshire
Sep 5, 2022
Author

cheshire Sep 6, 2022
Author