Optimize SplineEvaluator #499

blegouix · 2024-06-24T09:32:55Z

At

ddc/include/ddc/kernels/splines/spline_evaluator.hpp

Line 163 in 9f9292d

    
           KOKKOS_CLASS_LAMBDA(typename batch_domain_type::discrete_element_type const j) {

, replacing:

ddc::parallel_for_each(
                exec_space(),
                batch_domain,
                KOKKOS_CLASS_LAMBDA(typename batch_domain_type::discrete_element_type const j) {
                    const auto spline_eval_1D = spline_eval[j];
                    const auto coords_eval_1D = coords_eval[j];
                    const auto spline_coef_1D = spline_coef[j];
                    for (auto const i : evaluation_domain) {
                        spline_eval_1D(i) = eval(coords_eval_1D(i), spline_coef_1D);
                    }
                });

With:

ddc::parallel_for_each(
                exec_space(),
                spline_eval.domain(),
                KOKKOS_CLASS_LAMBDA(typename batched_evaluation_domain_type::discrete_element_type const e) {
                   typename evaluation_domain_type::discrete_element_type const i(e);
                   typename batch_domain_type::discrete_element_type const j(e);

                    spline_eval(i, j) = eval(coords_eval(i, j), spline_coef[j]);
                });

Makes the performance reduce from1.59us to 0.56us (for a benchmark nx=1000, ny=100000).

The optimal solution may require hierarchical parallelism in ddc (#396) though and maybe transposition of spline_coef to make spline_coef[j] contiguous.

@tpadioleau should I address this ?

The text was updated successfully, but these errors were encountered:

tpadioleau · 2024-06-26T14:32:40Z

Do you have a profiling of a spline interpolation on GPU ?

blegouix · 2024-06-26T15:32:26Z

I have this (for the current status):

(this is 1.36us in place of 1.59us but this is fluctuation, the case is the same)

I can provide .ncu-rep files if you want or the informations you need

tpadioleau · 2024-06-26T15:59:04Z

I would be interested to see a time trace like provided by nsys or from Kokkos simple kernel timer.

About the figures, I feel skeptical. I think they are of the order of magnitude of the latency of launching a GPU kernel, https://developer.nvidia.com/blog/understanding-the-visualization-of-overhead-and-latency-in-nsight-systems/#nsight_systems_overhead. Did you mean millisecond ?

blegouix · 2024-06-26T16:13:40Z

Yeaah, that's milliseconds sorry

blegouix · 2024-06-26T18:14:50Z

Anyway, in fact there is another problem with this table, it is for ny=10000, not 100000. And I realize the patch is propose is actually slower with ny=100000. So I close the issue.

blegouix closed this as completed Jun 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize SplineEvaluator #499

Optimize SplineEvaluator #499

blegouix commented Jun 24, 2024 •

edited

Loading

tpadioleau commented Jun 26, 2024

blegouix commented Jun 26, 2024

tpadioleau commented Jun 26, 2024

blegouix commented Jun 26, 2024

blegouix commented Jun 26, 2024

Optimize SplineEvaluator #499

Optimize SplineEvaluator #499

Comments

blegouix commented Jun 24, 2024 • edited Loading

tpadioleau commented Jun 26, 2024

blegouix commented Jun 26, 2024

tpadioleau commented Jun 26, 2024

blegouix commented Jun 26, 2024

blegouix commented Jun 26, 2024

blegouix commented Jun 24, 2024 •

edited

Loading