[SME] Utilize predication in fp32 matmul and conv2d schedules #17054
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Prior to this commit, the matmul and conv2d schedules required padding of the inputs to some multiple of vscale and a final "unpadding" stage.
Instead, we can leverage predicated operations to avoid the the requirement for padding. Both the transpose interleave and outer product fp32 intrinsics are updated to use predication. The
get_active_lane_mask
intrinsic is utilized to generate a variably sized mask of active lanes depending on the global position the tensor intrinsic is operating on.For now this relies on using
offset_of
andstride
information from the tensor we're predicating an access on. Likely we will want to build on this in the future with a more intuitive API for determining the current tile location.Support for batched conv2d was removed since this causes numerical issues which is suspected to be due to how the current tile is determined (paragraph above).
Note: this should not be merged until after #17048cc @ekalda @Anndrey24