Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize multi-threaded CKKS bootstrapping #908

Open
yspolyakov opened this issue Nov 19, 2024 · 2 comments
Open

Optimize multi-threaded CKKS bootstrapping #908

yspolyakov opened this issue Nov 19, 2024 · 2 comments
Assignees
Labels
optimization Improves performance
Milestone

Comments

@yspolyakov
Copy link
Contributor

Currently, we see a very modest speed-up. E.g. , when we run on a machine with 8 cores, the speed-up is only 2.2x (as compared to the single-threaded execution) vs the theoretical maximum of 8x.

@yspolyakov yspolyakov added the optimization Improves performance label Nov 19, 2024
@yspolyakov yspolyakov added this to the Release 1.3.0 milestone Nov 19, 2024
@yspolyakov yspolyakov changed the title Optimized multi-threaded CKKS bootstrapping Optimize multi-threaded CKKS bootstrapping Nov 19, 2024
@yspolyakov
Copy link
Contributor Author

I ran some experiments for CKKS bootstrapping.

Results for 8 threads (on a 16-core machine):

CKKS scheme is using ring dimension 65536

Input: (0.25, 0.5, 0.75, 1, 2, 3, 4, 5,  ... ); Estimated precision: 59 bits

Initial number of levels remaining: 1
Encoding time: 5764
Reduce time: 11974
Decode time: 2695
Total time:     22573 ms
Number of levels remaining after bootstrapping: 10

Output after bootstrapping
        (0.249857, 0.499905, 0.749801, 1, 2.00029, 2.99998, 4.00012, 5.00001,  ... ); Estimated precision: 13 bits

Results for 1 core

CKKS scheme is using ring dimension 65536

Input: (0.25, 0.5, 0.75, 1, 2, 3, 4, 5,  ... ); Estimated precision: 59 bits

Initial number of levels remaining: 1
Encoding time: 11847
Reduce time: 25371
Decode time: 5278
Total time:     47302 ms
Number of levels remaining after bootstrapping: 10

Output after bootstrapping
        (0.249984, 0.499862, 0.749969, 1.00005, 2.00009, 3.00001, 4.00006, 5.00008,  ... ); Estimated precision: 13 bits

So I only observed a speed-up of 2.1x

@yspolyakov
Copy link
Contributor Author

yspolyakov commented Nov 27, 2024

In most places we rely on OMP parallelization at the DCRTPoly level. I ran a polynomial benchmark (also for N = 64K, just like for CKKS bootstrapping; see the table below) and also got a modest speed-up (8 towers vs 1):

Add: 2.82x
Mul: 3.51x
NTT: 3.47x
iNTT: 3.34x

The fact that we get a low speed-up at DCRTPoly (and we have many other operations as well) leads to a small speed-up for CKKS bootstrapping.

Running bin/benchmark/poly-benchmark-64k
Run on (16 X 2496 MHz CPU s)
Load Average: 0.52, 0.58, 0.59
----------------------------------------------------------------------------------
Benchmark                                        Time             CPU   Iterations
----------------------------------------------------------------------------------
Native_AddEq                                  39.9 us         40.1 us        17920
DCRT_AddEq/towers:1                           40.5 us         40.8 us        17231
DCRT_AddEq/towers:2                           44.0 us         44.5 us        15448
DCRT_AddEq/towers:4                           70.0 us         69.8 us         8960
DCRT_AddEq/towers:8                            113 us          112 us         6400
DCRT_AddEq/towers:16                           241 us          240 us         2800
Native_SubEq                                   203 us          204 us         3446
DCRT_SubEq/towers:1                            201 us          200 us         3446
DCRT_SubEq/towers:2                            216 us          218 us         3446
DCRT_SubEq/towers:4                            257 us          256 us         2987
DCRT_SubEq/towers:8                            327 us          330 us         2133
DCRT_SubEq/towers:16                           652 us          645 us          896
Native_MulEq                                   131 us          132 us         4978
DCRT_MulEq/towers:1                            130 us          131 us         5600
DCRT_MulEq/towers:2                            147 us          148 us         4978
DCRT_MulEq/towers:4                            179 us          180 us         3733
DCRT_MulEq/towers:8                            299 us          298 us         2358
DCRT_MulEq/towers:16                           598 us          600 us         1120
Native_ntt                                     984 us          977 us          640
DCRT_ntt/towers:1                              982 us          983 us          747
DCRT_ntt/towers:2                             1101 us         1099 us          640
DCRT_ntt/towers:4                             1425 us         1412 us          498
DCRT_ntt/towers:8                             2268 us         2247 us          299
DCRT_ntt/towers:16                            4767 us         4719 us          149
Native_intt                                    970 us          952 us          640
DCRT_intt/towers:1                             987 us          952 us          640
DCRT_intt/towers:2                            1041 us         1050 us          640
DCRT_intt/towers:4                            1457 us         1443 us          498
DCRT_intt/towers:8                            2323 us         2299 us          299
DCRT_intt/towers:16                           4814 us         4849 us          145

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Improves performance
Projects
None yet
Development

No branches or pull requests

2 participants