-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add scmul for WBW Montgomery #847
base: master
Are you sure you want to change the base?
Conversation
Thanks, that was quick! Let me benchmark to confirm the new code is actually faster than repeated additions. |
Unfortunately, in my benchmarking, these scmul routines are slower than repeated additions. fiat_p224_add takes about 12ns; tripling and quadrupling take about 25ns; and octupling takes about 38ns. The generated scmul routines all take 50+ ns. |
Hmm, that's unfortunate. I wonder if we can speed things up with rewrite rules that turn multiplications into shifts and masks. Thoughts @davidben @andres-erbsen ? @mdempsky I take it the bitshifting code in Go is even faster than repeated additions? Could you share a link to it? |
@JasonGross I haven't benchmarked it specifically, but I assume the bitshifting is faster if only because it's fewer instructions. I can measure though if it would be helpful. Here's an example of "delta = 8 * beta" being computed: https://github.com/golang/go/blob/e88ea87e7b886815cfdadc4cd3d70bf5ef833bd7/src/crypto/elliptic/p224.go#L618-L620 In my fiat port, I wrote that instead as:
Here's code that computes "t = 3 * t": https://github.com/golang/go/blob/e88ea87e7b886815cfdadc4cd3d70bf5ef833bd7/src/crypto/elliptic/p224.go#L600-L602 I rewrote that to:
Edit: Note that p224.go uses [8]uint32 with radix-2^28, so they've got some extra headroom for doing shifts and delaying carries. |
In Montgomery, there is no separate carrying, and limbs are always saturated. It sounds like the current code uses an unsaturated Solinas representation? If you want me to generate scmul that just inlines a bunch of double+add, I can do that, though it'll take some time, and I think it's probably not worth it to generate such code at this point? |
I'm not hip with the lingo, but that sounds right.
Agreed that I don't think it's worth putting much effort into right now. I just knew there was the fixed-multiplication in curve25519, and thought this might be worth looking into if it was easy and provided a win. It seems like it doesn't at the moment. |
Here's a 64-bit scmul_8 implementation I came up with:
This implementation is barely slower than fiat_p224_add. The same idea with different initial rotations should work for scmul_4; and with slightly extra complexity, I think scmul_3 should be doable too. Caveat: Experimentally, on random inputs, it seems to match the output of three consecutive fiat_p224_add calls; but I haven't spent that long convincing myself that it doesn't have any corner cases that aren't handled correctly. Edit: I also think with some extra cleverness, the "add z * 2^96" and "sub z * (2^224+1)" steps could be fused. I haven't spent too much time thinking about it yet though. |
Yeah, replacing the first few assignments with this seems to work as scmul_3, and is still measurably faster than two additions:
|
Note that this isn't the kind of code we can write for Montgomery, because for the Montgomery code, we only have access to the prime as an integer, not to its representation as a sum/difference of taps. If you want to try to come up with a template that generates this sort of code for any prime, given just the integer representation and the bitwidth, I'm happy to integrate such a template and try to prove it correct. |
Does this help?
|
Alternatively:
|
Here it is as 32-bit/64-bit limb sequences (in little endian):
|
my reading of #847 (comment) (the example implementation above) is that it does saturated solinas multiplication followed by canonicalization, ignoring the assumption that the input is in Montgomery form but simplifying under the assumptions that As for how to access the solinas form of primes in Montgomery code, I think it would be just fine to pass them in when available or to infer them using code like @mdempsky posts aobe. |
As per
#706 (comment)
cc @mdempsky