Optimize multi-threaded CKKS bootstrapping #908

yspolyakov · 2024-11-19T21:40:21Z

Currently, we see a very modest speed-up. E.g. , when we run on a machine with 8 cores, the speed-up is only 2.2x (as compared to the single-threaded execution) vs the theoretical maximum of 8x.

yspolyakov · 2024-11-27T22:19:34Z

I ran some experiments for CKKS bootstrapping.

Results for 8 threads (on a 16-core machine):

CKKS scheme is using ring dimension 65536

Input: (0.25, 0.5, 0.75, 1, 2, 3, 4, 5,  ... ); Estimated precision: 59 bits

Initial number of levels remaining: 1
Encoding time: 5764
Reduce time: 11974
Decode time: 2695
Total time:     22573 ms
Number of levels remaining after bootstrapping: 10

Output after bootstrapping
        (0.249857, 0.499905, 0.749801, 1, 2.00029, 2.99998, 4.00012, 5.00001,  ... ); Estimated precision: 13 bits

Results for 1 core

CKKS scheme is using ring dimension 65536

Input: (0.25, 0.5, 0.75, 1, 2, 3, 4, 5,  ... ); Estimated precision: 59 bits

Initial number of levels remaining: 1
Encoding time: 11847
Reduce time: 25371
Decode time: 5278
Total time:     47302 ms
Number of levels remaining after bootstrapping: 10

Output after bootstrapping
        (0.249984, 0.499862, 0.749969, 1.00005, 2.00009, 3.00001, 4.00006, 5.00008,  ... ); Estimated precision: 13 bits

So I only observed a speed-up of 2.1x

yspolyakov · 2024-11-27T22:25:03Z

In most places we rely on OMP parallelization at the DCRTPoly level. I ran a polynomial benchmark (also for N = 64K, just like for CKKS bootstrapping; see the table below) and also got a modest speed-up (8 towers vs 1):

Add: 2.82x
Mul: 3.51x
NTT: 3.47x
iNTT: 3.34x

The fact that we get a low speed-up at DCRTPoly (and we have many other operations as well) leads to a small speed-up for CKKS bootstrapping.

Running bin/benchmark/poly-benchmark-64k
Run on (16 X 2496 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Native_AddEq                                  39.9 us         40.1 us        17920
DCRT_AddEq/towers:1                           40.5 us         40.8 us        17231
DCRT_AddEq/towers:2                           44.0 us         44.5 us        15448
DCRT_AddEq/towers:4                           70.0 us         69.8 us         8960
DCRT_AddEq/towers:8                            113 us          112 us         6400
DCRT_AddEq/towers:16                           241 us          240 us         2800
Native_SubEq                                   203 us          204 us         3446
DCRT_SubEq/towers:1                            201 us          200 us         3446
DCRT_SubEq/towers:2                            216 us          218 us         3446
DCRT_SubEq/towers:4                            257 us          256 us         2987
DCRT_SubEq/towers:8                            327 us          330 us         2133
DCRT_SubEq/towers:16                           652 us          645 us          896
Native_MulEq                                   131 us          132 us         4978
DCRT_MulEq/towers:1                            130 us          131 us         5600
DCRT_MulEq/towers:2                            147 us          148 us         4978
DCRT_MulEq/towers:4                            179 us          180 us         3733
DCRT_MulEq/towers:8                            299 us          298 us         2358
DCRT_MulEq/towers:16                           598 us          600 us         1120
Native_ntt                                     984 us          977 us          640
DCRT_ntt/towers:1                              982 us          983 us          747
DCRT_ntt/towers:2                             1101 us         1099 us          640
DCRT_ntt/towers:4                             1425 us         1412 us          498
DCRT_ntt/towers:8                             2268 us         2247 us          299
DCRT_ntt/towers:16                            4767 us         4719 us          149
Native_intt                                    970 us          952 us          640
DCRT_intt/towers:1                             987 us          952 us          640
DCRT_intt/towers:2                            1041 us         1050 us          640
DCRT_intt/towers:4                            1457 us         1443 us          498
DCRT_intt/towers:8                            2323 us         2299 us          299
DCRT_intt/towers:16                           4814 us         4849 us          145

yspolyakov added the optimization Improves performance label Nov 19, 2024

yspolyakov added this to the Release 1.3.0 milestone Nov 19, 2024

yspolyakov assigned pascoec Nov 19, 2024

yspolyakov changed the title ~~Optimized multi-threaded CKKS bootstrapping~~ Optimize multi-threaded CKKS bootstrapping Nov 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize multi-threaded CKKS bootstrapping #908

Optimize multi-threaded CKKS bootstrapping #908

yspolyakov commented Nov 19, 2024

yspolyakov commented Nov 27, 2024

yspolyakov commented Nov 27, 2024 •

edited

Loading

Optimize multi-threaded CKKS bootstrapping #908

Optimize multi-threaded CKKS bootstrapping #908

Comments

yspolyakov commented Nov 19, 2024

yspolyakov commented Nov 27, 2024

yspolyakov commented Nov 27, 2024 • edited Loading

yspolyakov commented Nov 27, 2024 •

edited

Loading