-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement SplitK and StreamK algorithm for Intel PVC #132
base: sycl-develop
Are you sure you want to change the base?
Implement SplitK and StreamK algorithm for Intel PVC #132
Conversation
…tlass-fork into intel_pvc_streamk
* Need to fix splitK for batch > 1
b6d7abf
to
58d082e
Compare
58d082e
to
f559e31
Compare
cef3fec
to
e4fefc2
Compare
e4fefc2
to
2ab1cd8
Compare
Could we also add the collective builder struct specialization for streamK/ splitK as a part of this PR ? |
include/cutlass/gemm/kernel/intel_pvc_persistent_tile_scheduler_params_streamk.hpp
Outdated
Show resolved
Hide resolved
include/cutlass/gemm/kernel/intel_pvc_persistent_tile_scheduler_params_streamk.hpp
Outdated
Show resolved
Hide resolved
include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp
Outdated
Show resolved
Hide resolved
…tlass-fork into intel_pvc_streamk
* Instantiate new accumulator registers per iteration
In light of #138 , could you rename the newly added |
BlockStripedReduceT::store(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx); | ||
} | ||
else { | ||
// Wait until the preceding split added its accumulators |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does this need to wait on the preceding one even for non-deterministic?
BarrierManager::wait_eq(barrier_idx, lock_workspace, barrier_group_thread_idx, lock_idx, work_tile_info.K_idx); | ||
|
||
// Perform reduction in workspace | ||
BlockStripedReduceT::reduce(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for the deterministic case, why does that need to be reduce(/atomic_add) and load_add isn't sufficient? My understanding is that for deterministic case only a single K_idx is adding at the same time and if so atomic shouldn't be needed.
include/cutlass/gemm/kernel/intel_pvc_tile_scheduler_streamk.hpp
Outdated
Show resolved
Hide resolved
…tlass-fork into intel_pvc_streamk
82d7092
to
db705da
Compare
Done. |
* Removed l2 workspace alignment
db705da
to
93d87fc
Compare
29fe8a5
to
8e20733
Compare
This PR adds the required code for implementing the SplitK and StreamK work distribution algorithms for Intel PVC.