Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-pass scan kernel template #1320

Merged
merged 78 commits into from
Apr 24, 2024
Merged

Single-pass scan kernel template #1320

merged 78 commits into from
Apr 24, 2024

Conversation

adamfidel
Copy link
Contributor

@adamfidel adamfidel commented Dec 18, 2023

This PR provides an implementation of the single-pass scan algorithm as a kernel template.

@timmiesmith timmiesmith modified the milestone: 2022.5.0 Jan 8, 2024
@adamfidel adamfidel modified the milestones: 2022.4.0, 2022.5.0 Jan 8, 2024
Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've left some minor comments.
Meanwhile, I'm looking at the kernel internals and will be back with the comments soon if any.

test/kt/single_pass_scan.cpp Outdated Show resolved Hide resolved
test/kt/single_pass_scan.cpp Outdated Show resolved Hide resolved
include/oneapi/dpl/experimental/kt/single_pass_scan.h Outdated Show resolved Hide resolved
Comment on lines +167 to +171
// Group reduction produces wrong results with multiplication of 64-bit for certain driver versions
// TODO: When a driver fix is provided to resolve this issue, consider altering this macro or checking the driver version at runtime
// of the underlying sycl::device to determine whether to include or exclude 64-bit type tests.
#define _PSTL_GROUP_REDUCTION_MULT_INT64_BROKEN 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this for the same problem as the macro right above (which is also documented as a known issue), or for a different one?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it was a different problem, but looking at it more, I think it's the same issue that the macros _PSTL_ICPX_TEST_RED_BY_SEG_BROKEN_64BIT_TYPES and ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION are meant for.

I know the internal ticket number for this issue, but it's hard to correlate that this is the same issue that these other macros are for because we do not (rightfully so) associate our internal issue tracking with these comments.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After an offline discussion with @mmichel11, we found that _PSTL_ICPX_TEST_RED_BY_SEG_BROKEN_64BIT_TYPES is for a separate issue.

The macro ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION seems to be from the same root issue, but I feel that we should still use a separate macro for the following reasons:

  1. The ONEDPL_WORKAROUND_FOR_IGPU_64BIT_REDUCTION macro is expected to be defined externally through CMake rather than defined in test_config.h with the other macros
  2. The macro's name suggests that when defined, it will apply some workaround for 64-bit group reductions with multiplies, whereas in this case, we want to explictly disable the tests cases with 64-bit types and the std::multiplies binary operator

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've checked everything besides atomics, especially the specifics of the used memory ordering. Everything what I've checked looks good to me.

Copy link
Contributor

@danhoeflinger danhoeflinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. I checked it far more thoroughly earlier on, but I have been reviewing delta commits and following the comments, and they look good to me.

My outstanding issues have been:

  1. init value, which we have decided to address in a later PR
  2. Memory footprint in SLM of joint algorithms, and how to communicate memory footprint requirements to the end user to provide them the proper guidance for selecting kernel params. I think this remains somewhat unresolved (as the sycl feature is a black box), but I think we can make a GH issue, and improve on this in the future (if we feel unsatisfied with our guidance in the docs with the current code). I don't think it is worth holding back this PR for this.

It would probably be good to get another approval from one of the others who have been working on this.

Copy link
Contributor

@dmitriy-sobolev dmitriy-sobolev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I've reviewed the atomic operations and found no race conditions.

One more suggestions to consider doing in a separate PR: use regular (non-atomic) variables for values.

@adamfidel adamfidel merged commit e625bf0 into main Apr 24, 2024
20 checks passed
@adamfidel adamfidel deleted the dev/adamfidel/scan_kt branch April 24, 2024 13:06
@adamfidel
Copy link
Contributor Author

Thanks all for the reviews!

@danhoeflinger and @dmitriy-sobolev, I will create GH issues to address the next steps that you have mentioned in your approvals.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants