Wasm-friendly Field #2638

mitschabaude · 2024-10-01T14:34:27Z

WIP, experiments towards #2634

define a "wasm-friendly" Field and test it with Poseidon

TODO: Poseidon results don't agree yet, need unit tests

Performance looks promising though: Close to 4x improvement, to reproduce run

./bench-wasm.sh --bench poseidon_bench

sebastiencs · 2024-10-14T06:33:26Z

Hi @mitschabaude, this is great work.
I've tried to integrate this implementation in openmina, but mul_assign does not give correct result:
openmina@3f8ed1f
When we multiply a number by 1, the result is zero

mitschabaude · 2024-10-14T07:07:47Z

Hi @mitschabaude, this is great work. I've tried to integrate this implementation in openmina, but mul_assign does not give correct result: openmina@3f8ed1f When we multiply a number by 1, the result is zero

@sebastiencs thanks for testing it, yeah these low-level algorithms are finicky, it will need some debugging to get right. I have an equivalent implementation here though that works perfectly, so I'm sure there's just some small fixable mistake.

I won't be able to finish this PR btw, no longer working at o1Labs -- sorry for leaving this here in an unsatisfying state :)

volhovm · 2024-10-14T16:59:33Z

@sebastiencs I'll look into this PR sometime soon, perhaps tomorrow, and will sync on what's the plan for continuing.

volhovm · 2024-11-21T14:20:47Z

I'm trying to see if we should proceed with it. Ran some benchmarks.

Benchmark results (in native rust) are as follows:

A native multiplication is 15ns, while Fp9 multiplication is 25ns.
- A native multiplication in WASM is 81ns, which is 5.4x more expensive than native.

Benchmark results (in WASM) are as follows:

A poseidon hash is 122 mcs in Fp vs 36 mcs in Fp9 (x3.38 improvement)
A conversion of a field element from fp to fp9 takes 54ns
- For a vector batch of 2^16 elements, conversions cost 3.5ms, including the cost of creating a new vector and copying everything there.
  - So for 16 columns, it's 56ms to convert everything.
  - This can be /easily/ parallelised also, and thus sped up several times.
A single multiplication in fp is 81ns vs 36ns in Fp9 (x2.25 improvement)
A single multiplication in fp9 with two conversions is 120ns.
Two multiplications in Fp take 157ns (z = x * y; z = z * x), while two conversions from Fp and two multiplications take 147ns.
1. With batch of 4 multiplications, we'll have 315ns vs 198ns, so 1.59x
2. And so on approaching x2.25.

@mitschabaude two questions, maybe you know?

Why is poseidon 3.38x faster, while multiplications are only 2.25x faster -- is poseidon using something like doubles instead of just multiplications? or add-in-place?
Do you have any numbers for other functions, e.g. how much would it cost to do a 2^16 MSM?
- My motivation: it seems that if we /at least/ do a multiplication per conversion, we're faster using Fp9. The question would be: if conversion to representation for 2^16 takes 3.5ms and is parallelisable... and if the native MSM takes several hundred milliseconds, how big of a factor can we save? It seems plausible that in the case of MSM conversions are negligibly cheap (under 1%), but I'd like to verify it somehow.

mitschabaude · 2024-11-21T15:00:36Z

Two important comments first:

I wouldn't trust any of the Fp9 numbers yet because of Wasm-friendly Field #2638 (comment)
- and also, there might be a small additional amount of optimization possible
There are two different kinds of Fp <-> Fp9 conversions to consider, depending on whether we convert because we're using both kinds of arithmetic on the same field element, OR we only use Fp9-style arithmetic, and the conversion is just done inside multiplication to match the 4 x 64 byte layout
- in the first case, conversion needs to move between different montgomery representations i.e. $2^N x$ to $2^M x$. right now this probably uses two separate ff multiplications but it could be done with 1 multiplication. this is what you benchmarked @volhovm
- in the second case, the field element is already in the right montgomery representation, and we only read out its bytes in a slightly shifted way. this is way cheaper and feasible to do within a single multiplication

mitschabaude · 2024-11-21T15:04:38Z

In a concrete scenario: If we're moving to a different representation just for the MSM, then we'll do on the order of 50-100 multiplications per conversion. so at that point the conversion is negligible

mitschabaude · 2024-11-21T15:11:26Z

Do you have any numbers for other functions, e.g. how much would it cost to do a 2^16 MSM?

I have detailed benchmarks for Wasm Pasta MSM of my own implementation that uses the Fp9 algorithm

Just to throw out a number, on my machine a 2^16 Pasta MSM using 16 threads takes about 80ms.
(I can show you how to run those benchmarks on your machine as well)

I know of another, better optimized project that probably does it in 50-60ms. (Using yet another field representation that I prototyped as well, that comes with some additional complications)

What I sadly don't have is detailed benchmarks for the Kimchi arkworks MSM. What I can offer is the table in this README, which compares single-threaded MSM running in arkworks with an older version of my code, on a 384-bit curve. In that comparison, my code showed a 7x speedup: https://github.com/mitschabaude/montgomery/blob/main/doc/zprize22.md

Caveat: Some of that 7x is not due to different field arithmetic but better high-level MSM algorithms, and I don't know exactly what contributed what.

mitschabaude · 2024-11-21T15:18:21Z

@volhovm here's how to run my Wasm Pasta MSM on your machine:

git clone [email protected]:mitschabaude/montgomery.git
cd montgomery
npm i
npm run evaluate

it will run the MSM 10 times, display the timings, average, deviation and more fine-grained details of the timing for one particular run

mitschabaude · 2024-11-21T15:25:28Z

I also have raw multiplication benchmarks, that match up well with your numbers:

npm run benchmark

for the fp9 representation I get

multiply montgomery      37ns

which matches your 36ns very well.

for the more complicated "51x5" representation I even get down to

multiply 51x5            27ns

here, a major caveat is that this is only that fast if you do two multiplications at once (because that's the only known efficient way to leverage SIMD)

mitschabaude · 2024-11-21T15:48:44Z

Why is poseidon 3.38x faster, while multiplications are only 2.25x faster -- is poseidon using something like doubles instead of just multiplications? or add-in-place?

fundamentally, I think the multiplication benchmarks are sound and show the real picture. There's no extra speedup to be expected in Poseidon (it's all field multiplication). So my guess is that the extra speed-up is not real and due to something getting zeroed out bc of a bug which makes it faster

curves/src/pasta/wasm_friendly/backend9.rs

sebastiencs · 2024-11-24T16:15:09Z

@mitschabaude I've tested the javascript implementation (Pallas/Fp), but it seems that it gives different results than mul_assign from algebra here.

When I call mul_assign([1, 0, 0, 0], [1, 0, 0, 0]) in algebra, the result is:
[ 14933811601259063721, 12438865661331504215, 8127635680764926565, 2445952672640170197, ]
Or in [u32; 9]:
[ 329738665, 435975226, 503256563, 527750272, 161897161, 348878820, 55952172, 535020623, 2224580, ]

When I run the same operation in js (multiply 1 by 1), I get a different result:
[ 10705750787161406286, 5046208377184785134, 12359664413395797203, 3391085346764690374, ]
Or in [u32; 9]:
[ 312294222, 76791028, 214219685, 259494129, 72168544, 212229055, 253406745, 83828258, 3084174, ]

I am probably missing something

mitschabaude · 2024-11-24T19:05:45Z

@sebastiencs amazing, thanks for finding the bug!!

I've tested the javascript implementation (Pallas/Fp), but it seems that it gives different results than mul_assign from algebra here.

they use different montgomery representations, you probably have to account for that in testing. there could be a from_bigint() at the beginning of each test on every input and a to_bigint() at the end for example, this would do the necessary conversion.
(the tests here do that as well: https://github.com/mitschabaude/montgomery/blob/main/src/field.test.ts)

mitschabaude · 2024-11-25T08:06:24Z

after making @sebastiencs's bug fix, for me Poseidon is down to a 2.6x improvement. still pretty nice! results still don't agree for full poseidon though.

mitschabaude · 2024-11-25T08:11:54Z

also, the Fp9 mul benchmark is now slower (52ns) for me than the JS version (37ns). however, it comes down to about the same as the JS version (34ns) when I refactor the benchmark to look like this:

pub fn bench_basic_ops(c: &mut Criterion) {
    let mut group = c.benchmark_group("Basic ops");
    let x0: Fp = rand::random();
    let x: Fp = x0;
    let mut z: Fp = x0;

    group.bench_function("Native multiplication in Fp (single)", |b| {
        b.iter(|| {
            z = z * x;
        });
    });

    let x_fp9: Fp9 = x0.into();
    let mut z_fp9: Fp9 = x0.into();

    group.bench_function("Multiplication in Fp9 (single)", |b| {
        b.iter(|| {
            z_fp9 = z_fp9 * x_fp9;
        });
    });
  }

@volhovm AFAIK, in JS hosts there is no very accurate timing available, so I would make sure to structure benchmarks such that definitely a lot of individual operations are done within a single timing. I'm not sure if your current benchmark, which seems to create new field elements for every multiplication, enables a completely accurate measurement.

Co-authored-by: Sebastien Chapuis <[email protected]>

sebastiencs · 2024-11-27T14:48:38Z

@mitschabaude Thanks, I've tested that implementation in our Webnode, and we get ~40% faster in poseidon hashing:
Hashing the full ledger (testnet) now takes 15 seconds, instead of 25 seconds.

openmina/openmina#939
The implementation is here: https://github.com/openmina/algebra/blob/6ca5fe8940ebbfe710601615655242ae89e17b0f/ff/src/fields/models/webnode_new.rs#L386

I see that your js implementation also has a square implementation, do you think it's worth porting to Rust ? Is it faster ?
Tests for that square are failing for Pallas/Fq, so not sure if I can use that in our WebNode.

Lastly, is there other parts worth porting from your repository, that would make our WebNode faster ?

Thanks again for your work !

mitschabaude · 2024-11-27T16:22:10Z

@mitschabaude Thanks, I've tested that implementation in our Webnode, and we get ~40% faster in poseidon hashing:
Hashing the full ledger (testnet) now takes 15 seconds, instead of 25 seconds.

@sebastiencs that's awesome!

I see that your js implementation also has a square implementation, do you think it's worth porting to Rust ? Is it faster ?
Tests for that square are failing for Pallas/Fq, so not sure if I can use that in our WebNode.

Yes, absolutely square should be ported as well. dedicated squaring is much faster than multiplication, like 28ns vs 37ns on my machine, and Poseidon uses 2 squarings per state element for the computation of x^7. not sure what you mean by tests are failing

sebastiencs · 2024-11-27T18:26:32Z

@mitschabaude Indeed, I just ported it and we can hash the ledger in 14 seconds with this dedicated square, so total improvement is 44%.

not sure what you mean by tests are failing

hmm it's actually the test for sqrt failing, not square: (using npm run test)

test at build/src/field.test.js:17:15
✖ pastaFq w=29 (147.170791ms)
  Error: pastaFq w=29 sqrt

  Failed - error during test execution:
  pastaFq w=29 sqrt: Expect equal results

  Deep equality failed:

  actual:   13019741092779036328468870678170431412962196708002643505427026797581769551489n
  expected: 11692013098647223345629478661730264157247460343809n

o1-labs/proof-systems#2638

volhovm · 2024-12-02T15:59:53Z

@sebastiencs Can you comment a bit on how hard would it be to port the changes you introduced in openmina into kimchi/curves/pasta/wasm-friendly? It seems quite similar, but it's hard to judge if your implementation relies on anything that is openmina specific, or mostly arkworks.

Another question or way to put a question: how big is the gap between MinimalField from this PR, and a full Field implementation?

volhovm · 2024-12-05T14:32:50Z

Status update: I've been trying to estimate how complicated it would be to bring an alternative field and plug it into kimchi tests. With a lot of stubs I managed to do it: the most intense parts are implementing bignum for 32 bits and Field (PrimeField also perhaps).

On the branch (name):

using the new pasta variant over the new field in kimchi: 5b2eb4b#diff-2f2a755b7d28c5485349afecc408a1a2825156693cde4da21f8dfcdab8d4feecR180
copy-pasted and highly stubbed backend code from arkworks:
- https://github.com/o1-labs/proof-systems/blob/volhovm/wasm-friendly-field-port/curves/src/pasta/wasm_friendly/wasm_fp.rs
- https://github.com/o1-labs/proof-systems/blob/volhovm/wasm-friendly-field-port/curves/src/pasta/wasm_friendly/bigint32.rs

My hypothesis would be that we could indeed continue porting it, and we'd perhaps need about 3-5k lines of code (primarily Bigints and Field implementation) fairly isolated in proof-systems/curves. I don't see how this would significantly affect anything above that crate, since after the KimchiCurve is defined, we use it pretty much as a black-box.

Any comments? Anything I'm missing in terms of how this can be /harder/ to port than what I just described? @mitschabaude @sebastiencs

mitschabaude · 2024-12-05T14:56:55Z

With a lot of stubs I managed to do it

Nice!

Any comments? Anything I'm missing in terms of how this can be /harder/ to port than what I just described?

Nothing that I can think of, your description matches what I would've thought!

volhovm self-assigned this Oct 3, 2024

Base automatically changed from perf/poseidon-wasm to develop October 11, 2024 16:26

sebastiencs mentioned this pull request Oct 14, 2024

As a developer, I want to use a faster field implementation, in wasm openmina/openmina#801

Closed

mitschabaude added 15 commits November 19, 2024 12:05

minimize assumptions on field trait for poseidon impl

b250a52

move into src/pasta

498aa67

reduce minimal field definition to 3 functions and 3 constants

a16a19c

rename, minor changes

22b42c1

start implementing one backend

c298437

multiplication, start trying to satisfy backend trait

a87bc0a

make it compile

c5c2ada

export fp9 type

0d17ab2

add coercion from fp to fp9

df262a1

add fp9 poseidon benchmark

56dd6e9

fix conversion from arkworks fp, and hard-code field constants

d99a80f

print field elements, poseidon result doesn't match yet

6ed627c

cargo fmt

9a8e7d5

fix some clippy warnings, ignore others

3ee9552

fix comments

a2422ff

volhovm force-pushed the perf/wasm-friendly-field branch from d727c5f to a2422ff Compare November 19, 2024 12:05

volhovm added 3 commits November 20, 2024 15:33

Add benchmarks for fp9/fp conversion

9e33755

Improve benchmarks for basic ops

bd5d6c6

More benchmarks

f73a324

sebastiencs reviewed Nov 24, 2024

View reviewed changes

curves/src/pasta/wasm_friendly/backend9.rs Outdated Show resolved Hide resolved

Fix for qi/carry computation for fp9

6795ad9

Co-authored-by: Sebastien Chapuis <[email protected]>

sebastiencs added a commit to openmina/algebra that referenced this pull request Nov 29, 2024

Add new field implementation: 32x9 to be used in webnode

f42a8ec

o1-labs/proof-systems#2638

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wasm-friendly Field #2638

Wasm-friendly Field #2638

mitschabaude commented Oct 1, 2024 •

edited

Loading

sebastiencs commented Oct 14, 2024

mitschabaude commented Oct 14, 2024 •

edited

Loading

volhovm commented Oct 14, 2024

volhovm commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024 •

edited

Loading

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

sebastiencs commented Nov 24, 2024

mitschabaude commented Nov 24, 2024

mitschabaude commented Nov 25, 2024

mitschabaude commented Nov 25, 2024 •

edited

Loading

sebastiencs commented Nov 27, 2024 •

edited

Loading

mitschabaude commented Nov 27, 2024

sebastiencs commented Nov 27, 2024 •

edited

Loading

volhovm commented Dec 2, 2024 •

edited

Loading

volhovm commented Dec 5, 2024

mitschabaude commented Dec 5, 2024

Wasm-friendly Field #2638

Are you sure you want to change the base?

Wasm-friendly Field #2638

Conversation

mitschabaude commented Oct 1, 2024 • edited Loading

sebastiencs commented Oct 14, 2024

mitschabaude commented Oct 14, 2024 • edited Loading

volhovm commented Oct 14, 2024

volhovm commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024 • edited Loading

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

mitschabaude commented Nov 21, 2024

sebastiencs commented Nov 24, 2024

mitschabaude commented Nov 24, 2024

mitschabaude commented Nov 25, 2024

mitschabaude commented Nov 25, 2024 • edited Loading

sebastiencs commented Nov 27, 2024 • edited Loading

mitschabaude commented Nov 27, 2024

sebastiencs commented Nov 27, 2024 • edited Loading

volhovm commented Dec 2, 2024 • edited Loading

volhovm commented Dec 5, 2024

mitschabaude commented Dec 5, 2024

mitschabaude commented Oct 1, 2024 •

edited

Loading

mitschabaude commented Oct 14, 2024 •

edited

Loading

mitschabaude commented Nov 21, 2024 •

edited

Loading

mitschabaude commented Nov 25, 2024 •

edited

Loading

sebastiencs commented Nov 27, 2024 •

edited

Loading

sebastiencs commented Nov 27, 2024 •

edited

Loading

volhovm commented Dec 2, 2024 •

edited

Loading