Replies: 2 comments 4 replies
-
Submitted in tensorflow/tensorflow@9f89ac6 |
Beta Was this translation helpful? Give feedback.
2 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
This outlines new proposed approach for the transposition emitter in XLA:GPU.
Motivation
Transposes are known to be a “weak point” of the XLA:GPU compiler. Many fusions containing transposes run at 1-10% of the throughput capacity, often becoming the “long pole” of the HLO module.
Moreover, the work on layout normalization is blocked by the current transposition emitter, as the current emitter inherently only supports physical, but not logical transpositions.
Current State
Currently we have a fragile heuristic inside XLA:
In practice, the property (ii) is ~never satisfied by any non-trivial benchmarks, and turning off the transposition emitter entirely has ~no effect on XLA benchmarks. Consequently, fusions containing transposes run very poorly relative to the roofline throughput capacity.
Previous Approaches
An approach has been tried to heuristically switch the iteration order of the elementwise emitter from logical to physical, which may speed up the generated code if both input and output have the same “physical” layout, but different logical one. That speeds up the example, but has detrimental effects on other nets, and added complexity to the emitter. The approach was removed in tensorflow/tensorflow@893e64a
A different approach attempted to only allow the fusion of transposes if they satisfy properties (i) and (ii), and to cut fusions otherwise. That has shown large performance gain on many nets, but has slowed down the others. Considering that (ii) is a very narrow property (broken by any bitcast, broadcast or reshape) it’s not surprising that the change negatively affects legitimate fusions.
Proposed Design
We can handle the transposition effectively the same way we handle reductions: form the kInput fusion when we see a transposing operation, do not allow other transposing operations inside, and keep track of the shared memory budget. We additionally add a new pass to "sink" the
kCopy
operation as far down as possible to form larger fusions.During code generation, use the tiled emitter for transpositions: replace the access to parameters with access to tiles, which will perform the 0-2-1 transposition to shared memory first.
Preliminary Results
Many benchmarks show ~15% gain from the change, investigation still ongoing
Beta Was this translation helpful? Give feedback.
All reactions