add `no_std` support #123

tjjfvi · 2024-04-10T17:24:56Z

HigherOrderBot · 2024-04-10T17:37:42Z

Perf run for e7e4246:

file              mode        main          e7e4246e2162
========================================================
merge_sort        intr-singl       7.660 s       7.527 s
                  intr-multi       4.032 s       3.718 s
--------------------------------------------------------
boom              intr-singl       2.272 s       2.274 s
                  intr-multi       2.608 s       3.006 s
--------------------------------------------------------
bitonic_sort_lam  intr-singl      12.153 s      11.347 s
                  intr-multi       5.888 s       5.580 s
--------------------------------------------------------
radix_sort_lam    intr-singl       9.715 s       9.342 s
                  intr-multi       4.915 s       4.644 s
--------------------------------------------------------
sum_tree          intr-singl       9.574 s       9.089 s
                  intr-multi       4.435 s       4.456 s
--------------------------------------------------------
c2                intr-singl       0.000 s       0.000 s
                  intr-multi       0.001 s       0.001 s
--------------------------------------------------------
sum_tail          intr-singl       1.327 s       1.186 s
                  intr-multi       1.196 s       1.148 s
--------------------------------------------------------
sum_rec           intr-singl      14.628 s      13.996 s
                  intr-multi       7.586 s       7.265 s

@add

This commit implements compilation of HVM terms. The idea, as explained in an early commit, is to modify the `@F ~ X` deref rule, so that, instead of unrolling `@F` and letting reductions happen normally, we instead pass `X` to a compiled procedure that attempts to perform some local reductions BEFORE allocating `@F`. For example, the HVM term: add = λa λb (+ a b) Is compiled to HVM as: @add = (a (b r)) & HigherOrderCO#1 ~ <a > So, if we apply `(add 123 100)`, we will have the following net: (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<a >⟧ Notice that `(HigherOrderCO#123 (HigherOrderCO#100 R))` is a dynamic net, and everything inside these «angle_brackets» is part of the "source code" of `@add`, i.e., these are static nets. As such, once we send `X` to the compiled `add()`, we can immediatelly detect that `X` isn't an aux port, but is actually the main port of the `(HigherOrderCO#123 (HigherOrderCO#100 R))` tree. Furthermore, since the root of `@add` is also two CON nodes (representing `λa` and `λb`), we can immediatelly substitute `a <~ HigherOrderCO#123`, `b <~ HigherOrderCO#100`, and `r <~ R`, performing two "local annihilations" before allocating the body of `@add`. As a result, we'll have the following net: R ~ ⟦r⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧ Now, we have an OP2 node connected to the number HigherOrderCO#1. Normally, that would require 4 rewrites to reduce to normal form: R ~ r HigherOrderCO#1 ~ <HigherOrderCO#123 <HigherOrderCO#100 r>> -------------------- OP2 R ~ r HigherOrderCO#123 ~ <HigherOrderCO#1 <HigherOrderCO#100 r>> -------------------- OP1 R ~ r #+123 ~ <HigherOrderCO#100 r> ---------------- OP2 R ~ r HigherOrderCO#100 ~ <#+123 r> ---------------- OP1 R ~ r #223 ~ r -------- subst #223 ~ R Yet, the compiled `add()` function can see, on its local registers, that `op = HigherOrderCO#1`, `a = HigherOrderCO#123` and `b = HigherOrderCO#100`. As such, it doesn't need to allocate any OP2 node, and can shortcut the reduction directly to: R ~ ⟦r⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧ ------------------------ OP2 + OP1 + OP2 + OP1 R ~ #223 Which bypasses the runtime entirely, saving several allocations and redex pushing/popping/matching. Sadly, Rust functions, unlike interaction nets, obey an evaluation order. As such, keeping a mini "local interaction net runtime" on registers would be inpractical. As such, we make a choice on the order that we traverse the "static net"; specifically, we first traverse the root tree, then the redex trees, in order. This is relevant, because it means that the order matters for which optimizations are used. For example, in this case, if we first traversed the redex trees, we'd have: ⟦HigherOrderCO#1⟧ ~ ⟦<a >⟧ (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ----------------------------- alloc `<a >` HigherOrderCO#1 ~ <a > (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ----------------------------- alloc `(a (b r))` HigherOrderCO#1 ~ <a > (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ (a (b r)) --------------------------- ... proceed reduction as normal I.e., when traversing `HigherOrderCO#1 ~ <a >`, the compiled `add()` function would see `a` and `b` (i.e., aux ports) instead of concrete numbers and, as such, it would be forced to allocate 2 OP2 nodes, `<a >`, and the optimization would fail, causing it to fall back to the interpreted speed. As such, it is important that tools emitting HVMC code to sort redexes in a way that allows optimizations to be performed more often. If redexes are sorted respecting the corresponding "strict evaluation" order, then functions compiled from classical paradigm should always hit the optimization case. For illustration, here is the compiled `add()` procedure: pub fn F_add(&mut self, ptr: Ptr, x: Ptr) -> bool { let xx : Ptr; let xy : Ptr; // fast apply if x.tag() == CT0 { self.anni += 1; xx = self.heap.get(x.val(), P1); xy = self.heap.get(x.val(), P2); self.heap.free(x.val()); } else { let k1 = self.heap.alloc(1); xx = Ptr::new(VR1, k1); xy = Ptr::new(VR2, k1); self.link(Ptr::new(CT0, k1), x); } let xyx : Ptr; let xyy : Ptr; // fast apply if xy.tag() == CT0 { self.anni += 1; xyx = self.heap.get(xy.val(), P1); xyy = self.heap.get(xy.val(), P2); self.heap.free(xy.val()); } else { let k2 = self.heap.alloc(1); xyx = Ptr::new(VR1, k2); xyy = Ptr::new(VR2, k2); self.link(Ptr::new(CT0, k2), xy); } let _k3 = Ptr::new(NUM, 0x1); let k4 : Ptr; // fast op if _k3.is_num() && xx.is_num() && xyx.is_num() { self.oper += 4; k4 = Ptr::new(NUM, self.op(self.op(_k3.val(),xx.val()),xyx.val())); } else { let k5 = self.heap.alloc(1); let k6 = self.heap.alloc(1); self.heap.set(k5, P2, Ptr::new(OP2, k6)); self.link(Ptr::new(VR1,k5), xx); self.link(Ptr::new(VR1,k6), xyx); self.link(Ptr::new(OP2,k5), _k3); k4 = Ptr::new(VR2, k6); } self.link(k4, xyy); return true; } Each optimization branch is labelled with a comment. The more optimization branches are hit, the faster your program will be. This commit results in a 1.55x speedup in the 'burn' benchmark (the one that decrements λ-encoded bits in parallel), a 2.94x speedup in a tree recursive sum, and a 5.64x speedup in a tail recursive sum. Note that tail recursion was NOT implemented yet, and there are still some allocations that can be skipped. With a better codegen, the maximum theoretical speedup should be of around 36x, which is what we obtain by manually polishing the generated functions.

add no_std support

b901319

tjjfvi requested a review from enricozb April 10, 2024 17:24

This comment has been minimized.

Sign in to view

tjjfvi added 2 commits April 10, 2024 13:29

add prelude to file list

eb2c4ac

enable std in compiled binaries

8bb8a61

enricozb approved these changes Apr 10, 2024

View reviewed changes

tjjfvi added this pull request to the merge queue Apr 10, 2024

Merged via the queue into main with commit 7626300 Apr 10, 2024
5 checks passed

tjjfvi deleted the no-std branch April 10, 2024 20:33

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add `no_std` support #123

add `no_std` support #123

tjjfvi commented Apr 10, 2024

This comment has been minimized.

HigherOrderBot commented Apr 10, 2024

add no_std support #123

add no_std support #123

Conversation

tjjfvi commented Apr 10, 2024

This comment has been minimized.

HigherOrderBot commented Apr 10, 2024

add `no_std` support #123

add `no_std` support #123