-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add no_std
support
#123
Merged
add no_std
support
#123
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This comment has been minimized.
This comment has been minimized.
Perf run for
|
enricozb
approved these changes
Apr 10, 2024
NaoEhSavio
pushed a commit
to NaoEhSavio/HVM2
that referenced
this pull request
Apr 30, 2024
This commit implements compilation of HVM terms. The idea, as explained in an early commit, is to modify the `@F ~ X` deref rule, so that, instead of unrolling `@F` and letting reductions happen normally, we instead pass `X` to a compiled procedure that attempts to perform some local reductions BEFORE allocating `@F`. For example, the HVM term: add = λa λb (+ a b) Is compiled to HVM as: @add = (a (b r)) & HigherOrderCO#1 ~ <a <b r>> So, if we apply `(add 123 100)`, we will have the following net: (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<a <b r>>⟧ Notice that `(HigherOrderCO#123 (HigherOrderCO#100 R))` is a dynamic net, and everything inside these «angle_brackets» is part of the "source code" of `@add`, i.e., these are static nets. As such, once we send `X` to the compiled `add()`, we can immediatelly detect that `X` isn't an aux port, but is actually the main port of the `(HigherOrderCO#123 (HigherOrderCO#100 R))` tree. Furthermore, since the root of `@add` is also two CON nodes (representing `λa` and `λb`), we can immediatelly substitute `a <~ HigherOrderCO#123`, `b <~ HigherOrderCO#100`, and `r <~ R`, performing two "local annihilations" before allocating the body of `@add`. As a result, we'll have the following net: R ~ ⟦r⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧ Now, we have an OP2 node connected to the number HigherOrderCO#1. Normally, that would require 4 rewrites to reduce to normal form: R ~ r HigherOrderCO#1 ~ <HigherOrderCO#123 <HigherOrderCO#100 r>> -------------------- OP2 R ~ r HigherOrderCO#123 ~ <HigherOrderCO#1 <HigherOrderCO#100 r>> -------------------- OP1 R ~ r #+123 ~ <HigherOrderCO#100 r> ---------------- OP2 R ~ r HigherOrderCO#100 ~ <#+123 r> ---------------- OP1 R ~ r #223 ~ r -------- subst #223 ~ R Yet, the compiled `add()` function can see, on its local registers, that `op = HigherOrderCO#1`, `a = HigherOrderCO#123` and `b = HigherOrderCO#100`. As such, it doesn't need to allocate any OP2 node, and can shortcut the reduction directly to: R ~ ⟦r⟧ ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧ ------------------------ OP2 + OP1 + OP2 + OP1 R ~ #223 Which bypasses the runtime entirely, saving several allocations and redex pushing/popping/matching. Sadly, Rust functions, unlike interaction nets, obey an evaluation order. As such, keeping a mini "local interaction net runtime" on registers would be inpractical. As such, we make a choice on the order that we traverse the "static net"; specifically, we first traverse the root tree, then the redex trees, in order. This is relevant, because it means that the order matters for which optimizations are used. For example, in this case, if we first traversed the redex trees, we'd have: ⟦HigherOrderCO#1⟧ ~ ⟦<a <b r>>⟧ (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ----------------------------- alloc `<a <b r>>` HigherOrderCO#1 ~ <a <b r>> (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧ ----------------------------- alloc `(a (b r))` HigherOrderCO#1 ~ <a <b r>> (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ (a (b r)) --------------------------- ... proceed reduction as normal I.e., when traversing `HigherOrderCO#1 ~ <a <b r>>`, the compiled `add()` function would see `a` and `b` (i.e., aux ports) instead of concrete numbers and, as such, it would be forced to allocate 2 OP2 nodes, `<a <b r>>`, and the optimization would fail, causing it to fall back to the interpreted speed. As such, it is important that tools emitting HVMC code to sort redexes in a way that allows optimizations to be performed more often. If redexes are sorted respecting the corresponding "strict evaluation" order, then functions compiled from classical paradigm should always hit the optimization case. For illustration, here is the compiled `add()` procedure: pub fn F_add(&mut self, ptr: Ptr, x: Ptr) -> bool { let xx : Ptr; let xy : Ptr; // fast apply if x.tag() == CT0 { self.anni += 1; xx = self.heap.get(x.val(), P1); xy = self.heap.get(x.val(), P2); self.heap.free(x.val()); } else { let k1 = self.heap.alloc(1); xx = Ptr::new(VR1, k1); xy = Ptr::new(VR2, k1); self.link(Ptr::new(CT0, k1), x); } let xyx : Ptr; let xyy : Ptr; // fast apply if xy.tag() == CT0 { self.anni += 1; xyx = self.heap.get(xy.val(), P1); xyy = self.heap.get(xy.val(), P2); self.heap.free(xy.val()); } else { let k2 = self.heap.alloc(1); xyx = Ptr::new(VR1, k2); xyy = Ptr::new(VR2, k2); self.link(Ptr::new(CT0, k2), xy); } let _k3 = Ptr::new(NUM, 0x1); let k4 : Ptr; // fast op if _k3.is_num() && xx.is_num() && xyx.is_num() { self.oper += 4; k4 = Ptr::new(NUM, self.op(self.op(_k3.val(),xx.val()),xyx.val())); } else { let k5 = self.heap.alloc(1); let k6 = self.heap.alloc(1); self.heap.set(k5, P2, Ptr::new(OP2, k6)); self.link(Ptr::new(VR1,k5), xx); self.link(Ptr::new(VR1,k6), xyx); self.link(Ptr::new(OP2,k5), _k3); k4 = Ptr::new(VR2, k6); } self.link(k4, xyy); return true; } Each optimization branch is labelled with a comment. The more optimization branches are hit, the faster your program will be. This commit results in a 1.55x speedup in the 'burn' benchmark (the one that decrements λ-encoded bits in parallel), a 2.94x speedup in a tree recursive sum, and a 5.64x speedup in a tail recursive sum. Note that tail recursion was NOT implemented yet, and there are still some allocations that can be skipped. With a better codegen, the maximum theoretical speedup should be of around 36x, which is what we obtain by manually polishing the generated functions.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
cc @Janiczek