Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NormalEstimationOMP: use dynamic scheduling for faster computation #5775

Merged
merged 1 commit into from
Aug 7, 2023

Conversation

mvieth
Copy link
Member

@mvieth mvieth commented Jul 30, 2023

So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search. With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better. The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not.

Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):

dataset #/mm speedup before speedup after
mug organized radius 10 3.4857 5.2508
mug organized radius 20 3.3441 5.1059
mug organized nearestk 50 4.7033 5.0594
mug organized nearestk 100 4.5808 4.9751
mug unorganized radius 10 3.3374 4.8992
mug unorganized radius 20 3.0206 4.7978
mug unorganized nearestk 50 4.5841 4.9189
mug unorganized nearestk 100 4.7062 4.8844
milk organized radius 10 3.5140 5.1686
milk organized radius 20 3.2605 5.1719
milk organized nearestk 50 4.3245 4.9924
milk organized nearestk 100 4.4170 4.9207
milk unorganized radius 10 3.4451 4.8029
milk unorganized radius 20 3.1887 4.8810
milk unorganized nearestk 50 4.3789 4.6894
milk unorganized nearestk 100 4.2717 4.7473

Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):

dataset #/mm speedup before speedup after
mug organized radius 10 2.3783 3.9812
mug organized radius 20 2.3080 3.9753
mug organized nearestk 50 3.6190 3.9595
mug organized nearestk 100 3.6100 3.9590
mug unorganized radius 10 2.4181 3.7466
mug unorganized radius 20 2.2157 3.8890
mug unorganized nearestk 50 3.4894 3.6551
mug unorganized nearestk 100 3.4293 3.7825
milk organized radius 10 2.8174 3.8209
milk organized radius 20 2.6911 3.9722
milk organized nearestk 50 3.3346 3.9433
milk organized nearestk 100 3.3275 3.9798
milk unorganized radius 10 2.8815 3.5443
milk unorganized radius 20 2.6467 3.7990
milk unorganized nearestk 50 3.1602 3.6469
milk unorganized nearestk 100 3.6460 3.7981

So far, no scheduling was specified, which seems to result in a behaviour similar to static scheduling. However, this is suboptimal, as the workload is not balanced well between the threads, especially when using radius search.
With dynamic scheduling (default chunk size of 256), the speedup (ratio of computation time of NormalEstimation and NormalEstimationOMP) is better.
The speedup for organized datasets is slightly higher than for unorganized datasets, possibly because FLANN (used for unorganized datasets) already uses some parallelization, while OrganizedNeighbor does not.

Laptop 1 (6 physical cores, 12 logical cores, number of threads set to 6):

dataset |             |          | #/mm | speedup before | speedup after
-----|-------------|----------|------|----------------|--------------
mug  | organized   | radius   | 10   | 3.4857         | 5.2508
mug  | organized   | radius   | 20   | 3.3441         | 5.1059
mug  | organized   | nearestk | 50   | 4.7033         | 5.0594
mug  | organized   | nearestk | 100  | 4.5808         | 4.9751
mug  | unorganized | radius   | 10   | 3.3374         | 4.8992
mug  | unorganized | radius   | 20   | 3.0206         | 4.7978
mug  | unorganized | nearestk | 50   | 4.5841         | 4.9189
mug  | unorganized | nearestk | 100  | 4.7062         | 4.8844
milk | organized   | radius   | 10   | 3.5140         | 5.1686
milk | organized   | radius   | 20   | 3.2605         | 5.1719
milk | organized   | nearestk | 50   | 4.3245         | 4.9924
milk | organized   | nearestk | 100  | 4.4170         | 4.9207
milk | unorganized | radius   | 10   | 3.4451         | 4.8029
milk | unorganized | radius   | 20   | 3.1887         | 4.8810
milk | unorganized | nearestk | 50   | 4.3789         | 4.6894
milk | unorganized | nearestk | 100  | 4.2717         | 4.7473

Laptop 2 (4 physical cores, 8 logical cores, number of threads set to 4):

dataset |             |          | #/mm | speedup before | speedup after
-----|-------------|----------|------|----------------|--------------
mug  | organized   | radius   | 10   | 2.3783         | 3.9812
mug  | organized   | radius   | 20   | 2.3080         | 3.9753
mug  | organized   | nearestk | 50   | 3.6190         | 3.9595
mug  | organized   | nearestk | 100  | 3.6100         | 3.9590
mug  | unorganized | radius   | 10   | 2.4181         | 3.7466
mug  | unorganized | radius   | 20   | 2.2157         | 3.8890
mug  | unorganized | nearestk | 50   | 3.4894         | 3.6551
mug  | unorganized | nearestk | 100  | 3.4293         | 3.7825
milk | organized   | radius   | 10   | 2.8174         | 3.8209
milk | organized   | radius   | 20   | 2.6911         | 3.9722
milk | organized   | nearestk | 50   | 3.3346         | 3.9433
milk | organized   | nearestk | 100  | 3.3275         | 3.9798
milk | unorganized | radius   | 10   | 2.8815         | 3.5443
milk | unorganized | radius   | 20   | 2.6467         | 3.7990
milk | unorganized | nearestk | 50   | 3.1602         | 3.6469
milk | unorganized | nearestk | 100  | 3.6460         | 3.7981
@mvieth mvieth added changelog: enhancement Meta-information for changelog generation module: features labels Jul 30, 2023
@larshg
Copy link
Contributor

larshg commented Aug 3, 2023

Looks good 👍

I also read that there exist a guided as well as runtime (Set by env OMP_SCHEDULE variable) scheduling.

Btw. should we make directory with all these "compare" programs that gets created - I assume it would be nice to have such, when working with improving things.

Or do you use the output of ie. the already added google benchmarks and then do the math(speedup factor calculations) elsewhere?

@mvieth
Copy link
Member Author

mvieth commented Aug 3, 2023

I also read that there exist a guided as well as runtime (Set by env OMP_SCHEDULE variable) scheduling.

Yes, that's true. Technically, the guided schedule should have less overhead than the dynamic schedule. However, I read somewhere that the guided schedule is realized in a bad way in some OpenMP implementations, namely that the first chunk is too large and thus the work is again unbalanced between the threads. If I remember correctly, I tested the guided schedule some time ago and it was worse than the dynamic schedule for the normal estimation.

Btw. should we make directory with all these "compare" programs that gets created - I assume it would be nice to have such, when working with improving things.

Or do you use the output of ie. the already added google benchmarks and then do the math(speedup factor calculations) elsewhere?

I wrote a quick Python script (see below) that reads from a json file, created by the google benchmark, and computes the speedup overview. But I don't think it is nice enough to put it into the repo permanently. I did however extend our google benchmark for the normal estimation, maybe I can make a pull request to add that sometime.

#!/usr/bin/env python3
import json
import sys
with open(sys.argv[1]) as json_data:
    data = json.load(json_data)
average_speedup = 0
average_parallelization = 0
for dataset in ["mug", "milk"]:
    for typ in ["organized", "unorganized"]:
        search = "radius"
        for param in [10, 20]:
            time_w_omp = 1
            time_wo_omp = 1
            time_w_omp_cpu = 1
            for benchmark in data["benchmarks"]:
                if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:5/repeats:3_mean":
                    time_wo_omp = benchmark["real_time"]
                if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_radius/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
                    time_w_omp = benchmark["real_time"]
                    time_w_omp_cpu = benchmark["cpu_time"]
            print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
            average_speedup += time_wo_omp/time_w_omp
            average_parallelization += time_w_omp_cpu/time_w_omp
        search = "nearestk"
        for param in [50, 100]:
            time_w_omp = 1
            time_wo_omp = 1
            time_w_omp_cpu = 1
            for benchmark in data["benchmarks"]:
                if benchmark["name"] == "BM_NormalEstimation_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:5/repeats:3_mean":
                    time_wo_omp = benchmark["real_time"]
                if benchmark["name"] == "BM_NormalEstimationOMP_" + dataset + "_" + typ + "_nearest_k/" + str(param) + "/iterations:10/repeats:3/process_time/real_time_mean":
                    time_w_omp = benchmark["real_time"]
                    time_w_omp_cpu = benchmark["cpu_time"]
            print(dataset, typ, search, param, int(time_wo_omp+0.5), "/", int(time_w_omp+0.5), time_wo_omp/time_w_omp, time_w_omp_cpu/time_w_omp)
            average_speedup += time_wo_omp/time_w_omp
            average_parallelization += time_w_omp_cpu/time_w_omp
print("average speedup=", average_speedup/16)
print("average parallelization=", average_parallelization/16)

@larshg
Copy link
Contributor

larshg commented Aug 3, 2023

I guess other OMP implementations could use this setting as well, since most of those also require some search for neighbors, which can vary a lot and hence vary the computation between each iteration?

@mvieth mvieth merged commit 97ef1b7 into PointCloudLibrary:master Aug 7, 2023
13 checks passed
@mvieth mvieth deleted the normal3d_dynamic_omp branch August 7, 2023 12:24
mvieth added a commit to mvieth/pcl that referenced this pull request Oct 19, 2024
Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule.
Related to PointCloudLibrary#5775

Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H:

GCC:
threads | 1    | 2    | 3    | 4    | 5    | 6    |
before  | 2267 | 1725 | 1283 | 1039 |  863 |  744 |
dynamic | 2269 | 1155 |  795 |  611 |  497 |  427 |

MSVC 2022 (release configuration):
threads | 1    | 2    | 3    | 4    | 5    | 6    |
before  | 2400 | 1886 | 1478 | 1176 |  972 |  857 |
dynamic | 2501 | 1281 |  919 |  704 |  593 |  537 |
mvieth added a commit that referenced this pull request Oct 20, 2024
Each iteration does a radius search, which does not take the same amount of time for each point. Specifying no schedule usually results in a static schedule.
Related to #5775

Benchmarks with table_scene_mug_stereo_textured.pcd (nan points removed before convolution) on Intel Core i7-9850H:

GCC:
threads | 1    | 2    | 3    | 4    | 5    | 6    |
before  | 2267 | 1725 | 1283 | 1039 |  863 |  744 |
dynamic | 2269 | 1155 |  795 |  611 |  497 |  427 |

MSVC 2022 (release configuration):
threads | 1    | 2    | 3    | 4    | 5    | 6    |
before  | 2400 | 1886 | 1478 | 1176 |  972 |  857 |
dynamic | 2501 | 1281 |  919 |  704 |  593 |  537 |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
changelog: enhancement Meta-information for changelog generation module: features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants