Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement SplitK and StreamK algorithm for Intel PVC #132

Open
wants to merge 34 commits into
base: sycl-develop
Choose a base branch
from

Conversation

muhammad-tanvir-1211
Copy link
Collaborator

This PR adds the required code for implementing the SplitK and StreamK work distribution algorithms for Intel PVC.

@muhammad-tanvir-1211 muhammad-tanvir-1211 marked this pull request as draft September 2, 2024 16:25
@muhammad-tanvir-1211 muhammad-tanvir-1211 force-pushed the intel_pvc_streamk branch 2 times, most recently from cef3fec to e4fefc2 Compare September 4, 2024 11:59
@AD2605
Copy link
Collaborator

AD2605 commented Sep 6, 2024

Could we also add the collective builder struct specialization for streamK/ splitK as a part of this PR ?

@AD2605
Copy link
Collaborator

AD2605 commented Sep 11, 2024

In light of #138 , could you rename the newly added intel_pvc_* files to xe_* ?, it would be one less change later on

BlockStripedReduceT::store(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx);
}
else {
// Wait until the preceding split added its accumulators
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does this need to wait on the preceding one even for non-deterministic?

BarrierManager::wait_eq(barrier_idx, lock_workspace, barrier_group_thread_idx, lock_idx, work_tile_info.K_idx);

// Perform reduction in workspace
BlockStripedReduceT::reduce(reduction_workspace_array, *accumulator_array, barrier_group_thread_idx);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the deterministic case, why does that need to be reduce(/atomic_add) and load_add isn't sufficient? My understanding is that for deterministic case only a single K_idx is adding at the same time and if so atomic shouldn't be needed.

@muhammad-tanvir-1211
Copy link
Collaborator Author

In light of #138 , could you rename the newly added intel_pvc_* files to xe_* ?, it would be one less change later on

Done.

* Removed l2 workspace alignment
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants