-
Notifications
You must be signed in to change notification settings - Fork 138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add local memory CUDA support to Kernel #126
Conversation
Hello @adityapb! Thanks for updating the PR.
Comment last updated on October 24, 2018 at 10:47 Hours UTC |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we should discuss this a bit. In the meanwhile can you address the comments I have made?
pysph/cpy/translator.py
Outdated
@@ -255,14 +259,17 @@ def get_struct_from_instance(self, obj): | |||
helper = CStructHelper(obj) | |||
return helper.get_code() + '\n' | |||
|
|||
def parse(self, obj): | |||
def parse(self, obj, local_decl=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this too much and am not sure why you need to do this? Can you explain why the signature of the object method does not have the necessary information?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also if this is going to be added, we should have a proper docstring as the context here is totally non-obvious. For the case of parse itself, obj
is somewhat obvious but not the local_decl as it seems very CUDA specific.
pysph/cpy/translator.py
Outdated
else: | ||
raise TypeError('Unsupported type to wrap: %s' % obj_type) | ||
self._local_decl = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be set even when the type error is raised?
pysph/cpy/translator.py
Outdated
@@ -532,6 +539,26 @@ def visit_FunctionDef(self, node): | |||
args = self._get_function_args(node) | |||
body = '\n'.join(self._indent_block(self.visit(item)) | |||
for item in self._remove_docstring(node.body)) | |||
local_decl = '' | |||
if self._local_decl: | |||
decls = ['extern LOCAL_MEM float shared_buff[];'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very strange, why is it declared as a float?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess I am missing something with the way you define and use local memory in CUDA but I think this needs to be properly explained.
pysph/cpy/translator.py
Outdated
if self._local_decl: | ||
decls = ['extern LOCAL_MEM float shared_buff[];'] | ||
for arg, dtype in self._local_decl.items(): | ||
if len(decls) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But len(decls) will always be 1 according to above!
pysph/cpy/translator.py
Outdated
'&%(prev_arg)s[size_%(prev_arg)s];') | ||
local_decl = local_decl % {'dtype': dtype, 'arg': arg, | ||
'prev_arg': prev_arg} | ||
decls.append(local_decl) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This whole block of code is not at all intentional and it is very difficult to see what you are trying to do. I think it really needs to be refactored.
pysph/cpy/translator.py
Outdated
@@ -693,3 +694,141 @@ def __init__(self, detect_type=ocl_detect_type, known_types=None): | |||
|
|||
def _get_self_type(self): | |||
return KnownType('GLOBAL_MEM %s*' % self._class_name) | |||
|
|||
|
|||
class CUDAConverter(CConverter): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please try to reuse as much code as possible, derive this from OpenCLConverter
so we don't have to update things in two places.
pysph/cpy/translator.py
Outdated
def _get_self_type(self): | ||
return KnownType('GLOBAL_MEM %s*' % self._class_name) | ||
|
||
def _get_function_args(self, node): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again, please try to refactor to minimize cut/pasting.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like that this is cleaner but do want no cut/pasting of code as it makes it very hard to maintain. Please make the suggested changes. If possible, squash your unnecessary commits.
pysph/cpy/translator.py
Outdated
return code | ||
|
||
def visit_FunctionDef(self, node): | ||
assert node.args.vararg is None, \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again a lot of similar code that is cut pasted. Any cut pasting makes it very hard to update things later. I don't mind adding an empty method to the base class that is only used for CUDA but you need to think about reducing cut/pasting and try to avoid sending PRs with any of it. :) Also please measure coverage for just the translator alone and ensure that the translator has almost 100% coverage.
7710883
to
cefc028
Compare
8c554ab
to
21e79e1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks much better. I am sorry to ask for more changes but the alignment can be a cause for subtle issues and errors, so I think it should be handled cleanly. You do have the local size information, so would it be possible to compute a 4 byte aligned quantity and use that for the size? At the very least this should be clearly documented.
prev_arg = arg | ||
else: | ||
local_decl = ('%(dtype)s* %(arg)s = (%(dtype)s*) ' | ||
'&%(prev_arg)s[size_%(prev_arg)s];') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is one big problem here if the sizes are not aligned to 4 bytes as discussed in the cuda programming guide. So this should perhaps be carefully documented or perhaps a warning given if there is an alignment problem.
I've filed an issue #130, for you to address. I will merge this for now. Thanks! |
No description provided.