Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Use the fact that the plan is splitting each matrix in regular squares to replace memorization of the plan with inline computation.
Loops are in dim^3 (worst case... number of gemms in each phase to be exact) to compute the big things (number of gemms in a phase, list of gemms in a phase).
This is not ideal yet: at the same time we were building the computational plan, we would build the communication plan. We can skip the computational plan building step now, but we still need to build the communication plan. Each task needs to know exactly what other tasks it passes data to, and because tasks are named with plan index, this means the communication tasks need to remember which communication phase is connected to which computation phase.
Storing the communication plan is much smaller, though, and the objects don't need to be sorted / ordered.
Hopefully this reduces the time spent building the plan significantly already. Still working to remove the plan building altogether.