Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize SplineEvaluator #499

Closed
blegouix opened this issue Jun 24, 2024 · 5 comments
Closed

Optimize SplineEvaluator #499

blegouix opened this issue Jun 24, 2024 · 5 comments

Comments

@blegouix
Copy link
Collaborator

blegouix commented Jun 24, 2024

At

KOKKOS_CLASS_LAMBDA(typename batch_domain_type::discrete_element_type const j) {
, replacing:

ddc::parallel_for_each(
                exec_space(),
                batch_domain,
                KOKKOS_CLASS_LAMBDA(typename batch_domain_type::discrete_element_type const j) {
                    const auto spline_eval_1D = spline_eval[j];
                    const auto coords_eval_1D = coords_eval[j];
                    const auto spline_coef_1D = spline_coef[j];
                    for (auto const i : evaluation_domain) {
                        spline_eval_1D(i) = eval(coords_eval_1D(i), spline_coef_1D);
                    }
                });

With:

ddc::parallel_for_each(
                exec_space(),
                spline_eval.domain(),
                KOKKOS_CLASS_LAMBDA(typename batched_evaluation_domain_type::discrete_element_type const e) {
                   typename evaluation_domain_type::discrete_element_type const i(e);
                   typename batch_domain_type::discrete_element_type const j(e);

                    spline_eval(i, j) = eval(coords_eval(i, j), spline_coef[j]);
                });

Makes the performance reduce from1.59us to 0.56us (for a benchmark nx=1000, ny=100000).

The optimal solution may require hierarchical parallelism in ddc (#396) though and maybe transposition of spline_coef to make spline_coef[j] contiguous.

@tpadioleau should I address this ?

@tpadioleau
Copy link
Member

Do you have a profiling of a spline interpolation on GPU ?

@blegouix
Copy link
Collaborator Author

I have this (for the current status):
image

(this is 1.36us in place of 1.59us but this is fluctuation, the case is the same)

I can provide .ncu-rep files if you want or the informations you need

@tpadioleau
Copy link
Member

I would be interested to see a time trace like provided by nsys or from Kokkos simple kernel timer.

About the figures, I feel skeptical. I think they are of the order of magnitude of the latency of launching a GPU kernel, https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/#nsight_systems_overhead. Did you mean millisecond ?

@blegouix
Copy link
Collaborator Author

Yeaah, that's milliseconds sorry

@blegouix
Copy link
Collaborator Author

Anyway, in fact there is another problem with this table, it is for ny=10000, not 100000. And I realize the patch is propose is actually slower with ny=100000. So I close the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants