Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature] Parallelize kwavearray functions #523

Open
faberno opened this issue Dec 1, 2024 · 7 comments
Open

[Feature] Parallelize kwavearray functions #523

faberno opened this issue Dec 1, 2024 · 7 comments
Labels
enhancement New feature or request

Comments

@faberno
Copy link
Contributor

faberno commented Dec 1, 2024

Is your feature request related to a problem? Please describe.
I was wondering if its possible to parallelize get_array_binary_mask and combine_sensor_data of the kWaveArray class.
For simulations with many array elements these functions are a major bottleneck.

Describe the solution you'd like
Currently these functions contain a loop which iterates over every element. But the iterations are independent of each other, so in theory it should be possible to multiprocess them. This would probably require some refactoring to avoid copying the kwavearray and kgrid class for every thread.

@faberno faberno added the enhancement New feature or request label Dec 1, 2024
@waltsims
Copy link
Owner

waltsims commented Dec 1, 2024 via email

@djps
Copy link
Collaborator

djps commented Dec 3, 2024

What would be the best approach? There are a few things which can be done to accelerate the code: refactoring with list comprehension to remove loops; joblib; JIT with numba; cupy.

@faberno
Copy link
Contributor Author

faberno commented Dec 3, 2024

I just looked into get_array_binary_mask and by only refactoring it a bit I could reduce the runtime for 10,000 elements from 190 to 8 seconds.

This is the original loop:

for ind in range(self.number_elements):
    grid_weights = self.get_off_grid_points(kgrid, ind, True)
    mask = np.bitwise_or(np.squeeze(mask), grid_weights)

In self.get_off_grid_points we first calculate the integration_points (fast) and then the grid_weights (slow).
So I first computed the integration points for all elements (~1 second), stacked them into one array and gave them to off_grid_points(...) alltogether, which saves a lot of uneccessary calls of off_grid_points(...) and we don't need to OR the mask anymore.

@faberno
Copy link
Contributor Author

faberno commented Dec 5, 2024

I could also reduce combine_sensor_data from around ~600s to 37s (for 10,000 elements), without any major changes or parallelization.
Should I open a draft PR with these changes, where we can discuss them and think of more optimizations?

@waltsims
Copy link
Owner

waltsims commented Dec 6, 2024

That would be great. Thanks @faberno!

@waltsims
Copy link
Owner

@faberno should we try to get these updates into v0.4.1 in the new year?

@faberno
Copy link
Contributor Author

faberno commented Dec 24, 2024

That would be great. Will open my promised PR tomorrow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants