Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

low performance in network not in localhost #57

Open
mortezaataiy opened this issue Apr 4, 2022 · 12 comments
Open

low performance in network not in localhost #57

mortezaataiy opened this issue Apr 4, 2022 · 12 comments
Labels

Comments

@mortezaataiy
Copy link

Hello
I am testing caml-crush performance to support a very high requests rate (for example 1000 requests per second)
Using this proxy locally is good for me and 1000 requests are completed in 4 seconds. But not good on the network.
I tested on my network with the REST API and the 1000 request ends in 9 seconds. Increases to 18 seconds when using caml-crush for proxy.
What changes when I use caml-crush in localhost vs the network?
I tried disabling all filter features and also changing the "netplex.service.workload_manager" settings, but none of them change performance.
Can anyone help me, please?

@rben-dev
Copy link
Contributor

rben-dev commented Apr 5, 2022

Hi,

When you say "tested locally", you mean through a Unix socket or through a network TCP socket on localhost?

If your test is on the localhost network interface, the performance loss is indeed weird ... You can use for instance a tcpdump or wireshark on your network interface to check for each request the average time of processing by the proxy (i.e. between a request and the response), and then blame the proxy or something else in the network setup.

Regards,

@calderonth
Copy link
Contributor

Hello @mortezaataiy , you have not provided enough information for us to try to understand where the performance issues you are observing are coming from.
One potential such area is dependent on how your are running your test.
Should you perform via an external program that loads/unloads the PKCS#11 library you would incur a new connection to the proxy which would be costly even on localhost.

In essence, if you provide with more details we might be able to help.

@mortezaataiy
Copy link
Author

mortezaataiy commented Apr 6, 2022

TCP socket on localhost:
one machine: SoftHSM <-> caml-crush-server <-tcp,127.0.0.1:4444-> caml-crush-client <-> DSS
speed: 1000 sign in 4 seconds

TCP socket on network:
machine1: SoftHSM <-> caml-crush-server
machine2: caml-crush-client <-> DSS
machine1 <-tcp,192.168.9.99:4444-> machine2
speed: 1000 sign in 18 seconds
ping 192.168.9.99: 190ms

Caml-crush uses default configs. Instead of:
1: --with-client-socket=tcp,192.168.9.99:4444
2: wrapping_format_key uncommented but do not changed
3: forbidden_mechanisms commented
4: in pkcs11proxyd.conf netplex.service.protocol.address.bind = "192.168.9.99:4444";

DSS technologies is DotNetCore and use Net.Pkcs11Interop.X509Store and System.Security.Cryptography
It works singleton. Load store and login at first and use it for all requests.

Do proxy call parallel or serial?
I think network delay is exponential when we use caml-crush

I will try to check it with wireshark or tcpdump 👍

Thank you both @calderonth and @rb-anssi 🌹

@mortezaataiy
Copy link
Author

mortezaataiy commented Apr 10, 2022

Untitled Diagram
Do I need to provide more information? @calderonth

########
Edited:
First result time is 4 seconds
Second result time is 18 seconds
(For 1000 requests)

@rben-dev
Copy link
Contributor

Hi @mortezaataiy

Without caml-crush, how does DSS handle parallel requests with the original PKCS#11 library?
Also, can you provide more information about how does DSS handle parallelism for the parallel REST requests: does it fork multiple threads, multiple processes on the host?

caml-crush is based on Netplex which should spawn a new process for each new PKCS#11 session, so for multiple PKCS#11 sessions parallelism should not be an issue. It might be helpful to provide more information on what kind of PKCS#11 requests the REST API / DSS generates.

Finally, to check whether network and/or parallelism are to be blamed, an insightful test would be to see the time taken by 1000 sequential requests in both scenarios (localhost and through network).

Regards,

@mortezaataiy
Copy link
Author

Hi @rb-anssi

DotNet Core by default uses multiple threads and I didn't change it.
Net.Pkcs11Interop.X509Store handle the original PKCS#11 library and I don't customize it.

Netplex have some configs. I tried to change them but nothing changes. I improve my network and its ping is 1ms now but caml-crush spend 7 seconds for 1000 signs.

local parallel:
   without caml-crush: 1.5s
   with caml-crush: 4s
local sequential:
   without caml-crush: 7s
   with caml-crush: 7s

new network (1ms ping)
parallel in network:
   with caml-crush: 7s   ** It's still a lot
sequential in network:
   with caml-crush: 12s

Log for one sign:

0x1 : ****************************** 2022-04-11 11:13:43 ***
0x1 : Calling C_OpenSession
0x1 : Input
0x1 :  slotID: 1813751771
0x1 :  flags: 4
0x1 :   CKF_RW_SESSION: FALSE
0x1 :   CKF_SERIAL_SESSION: TRUE
0x1 :  pApplication: (nil)
0x1 :  Notify: (nil)
0x1 :  phSession: 0x7fdee6b855b0
0x1 :  *phSession: 140595330307504
0x1 : Output
0x1 :  phSession: 0x7fdee6b855b0
0x1 :  *phSession: 15
0x1 : Returning 0 (CKR_OK)
0x1 : ****************************** 2022-04-11 11:13:43 ***
0x1 : Calling C_SignInit
0x1 : Input
0x1 :  hSession: 15
0x1 :  pMechanism: 0x7fdee6b854f8
0x1 :   mechanism: 1 (CKM_RSA_PKCS)
0x1 :   pParameter: (nil)
0x1 :   ulParameterLen: (nil)
0x1 :  hKey: 6
0x1 : Returning 0 (CKR_OK)
0x1 : ****************************** 2022-04-11 11:13:43 ***
0x1 : Calling C_Sign
0x1 : Input
0x1 :  hSession: 15
0x1 :  pData: 0x7fdce0036b88
0x1 :  *pData: HEX(303130A7338F913C)
0x1 :  ulDataLen: 51
0x1 :  pSignature: (nil)
0x1 :  pulSignatureLen: 0x7fdee6b854e8
0x1 :  *pulSignatureLen: 0
0x1 : Output
0x1 :  pSignature: (nil)
0x1 :  pulSignatureLen: 0x7fdee6b854e8
0x1 :  *pulSignatureLen: 256
0x1 : Returning 0 (CKR_OK)
0x1 : ****************************** 2022-04-11 11:13:43 ***
0x1 : Calling C_Sign
0x1 : Input
0x1 :  hSession: 15
0x1 :  pData: 0x7fdce0036b88
0x1 :  *pData: HEX(303130A7338F913C)
0x1 :  ulDataLen: 51
0x1 :  pSignature: 0x7fdce0036df0
0x1 :  pulSignatureLen: 0x7fdee6b854e8
0x1 :  *pulSignatureLen: 256
0x1 : Output
0x1 :  pSignature: 0x7fdce0036df0
0x1 :  pulSignatureLen: 0x7fdee6b854e8
0x1 :  *pSignature: HEX(975A6BB492DEF9BB0F84F8A61E2FCD23A51DD3F123B437EB2CD8CF31EAC740498966AE99B3477DD38784C89B815BBA9E37E081A38E5005FC8A1B39F99E7D89AB660BF902D6DF414B881108DC4E97AC8390)
0x1 :  *pulSignatureLen: 256
0x1 : Returning 0 (CKR_OK)
0x1 : ****************************** 2022-04-11 11:13:43 ***
0x1 : Calling C_CloseSession
0x1 : Input
0x1 :  hSession: 15
0x1 : Returning 0 (CKR_OK)

So Net.Pkcs11Interop.X509Store open session for each request.
Thank you for your attention

@rben-dev
Copy link
Contributor

Hi,

Thanks for the figures and the feedback. If I get them right, processing with caml-crush goes from 4s to 7s in parallel between local and remote, while it goes from 1.5s to 4s without caml-crush (the 4s is taken from your first post). It seems to me that we have roughly the same scaling multiplicative factor between these scenarios, i.e. a factor around 2 between caml-crush and native implementation, which seems in line with the workload the proxy handles.

Can you please confirm that all the PKCS#11 sessions are opened and handled in different processes / TCP connections by Net.Pkcs11Interop.X509Store? (hence inducing a new fork for the proxy on the server side, which can be costly). Tracking the number of forks / processes will be helpful here.

Regards,

@mortezaataiy
Copy link
Author

The number of our active DSS threads is 4 by default (equal to the number of CPU cores). Each request handled in a separate thread. I increase the number of active threads but the performance decreases.
So yeah. All sessions are opened and handled in different processes.

I tried to call many sign functions in parallel without DSS. I call pkcs11-tool --sign command but it was so slow (about 1 sign per seconds)
It also has some errors:

Error RPC with C_SetupArch
Unsupported architecture error EXITING
caml-crush: C_SetupArch: failed detecting architecture

error: PKCS11 function C_Initialize failed: rv = CKR_DEVICE_ERROR (0x30)

How did you test the performance of Caml-crush? It is mentioned in this PDF: https://eprint.iacr.org/2015/063.pdf

Is this LIMITATIONS related to this discussion?
https://github.com/caml-pkcs11/caml-crush/blob/master/ISSUES.md#handling-synchronization-ocaml-client-library-
https://github.com/caml-pkcs11/caml-crush/blob/master/ISSUES.md#handling-synchronization-c-client-library

Also about ping time: DSSs can't be near to HSM so ping can be more than 100ms.
I'm coming to the conclusion that I should stop using caml-crush :(

Thanks

@calderonth
Copy link
Contributor

calderonth commented Apr 24, 2022 via email

@mortezaataiy
Copy link
Author

So why does it work well with DSS? :)
With DSS it do 50 signs per second.

I'm grateful

@mortezaataiy
Copy link
Author

I'm very excited to know how you checked Caml-crush performance?
I will very thankful if you say it @calderonth @rb-anssi

@calderonth
Copy link
Contributor

Hello @mortezaataiy ,

I now believe what you're observing is totally expected. If you have a high latency network, this will have a significant impact when using Caml-Crush client/server (as any other client/server that handles some sort of synchronous RPCs).
You might want to read more on the impact of latency/RTT on such network applications.

When looking at the other scenario you describe as performing better, I believe it is not an appropriate comparion.
My understanding is that you send the data to be signed in bulk via a REST API, it's very likely that this connexion is multiplexed and the API performs much better in high-latency scenarios (throughput can be maxed-out once the appropriate TCP settings are tuned by the TCP stack). Once the data is local to the CamlCrush client/HSM, then performing signatures is fast.

I have performed a performance comparison between a client/server on the same host (TCP) and a client/server on different Docker containers (same LAN). I can confirm that the performance difference between a TCP/localhost and TCP/lan is negligible.

It does sound like if you can't improve your RTT/latency and/or tweak the TCP settings to better cope with it then CamlCrush might not be a good fit here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants