Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add no_std support #123

Merged
merged 3 commits into from
Apr 10, 2024
Merged

add no_std support #123

merged 3 commits into from
Apr 10, 2024

Conversation

tjjfvi
Copy link
Contributor

@tjjfvi tjjfvi commented Apr 10, 2024

@tjjfvi tjjfvi requested a review from enricozb April 10, 2024 17:24
@HigherOrderBot

This comment has been minimized.

@HigherOrderBot
Copy link
Collaborator

Perf run for e7e4246:

file              mode        main          e7e4246e2162
========================================================
merge_sort        intr-singl       7.660 s       7.527 s
                  intr-multi       4.032 s       3.718 s
--------------------------------------------------------
boom              intr-singl       2.272 s       2.274 s
                  intr-multi       2.608 s       3.006 s
--------------------------------------------------------
bitonic_sort_lam  intr-singl      12.153 s      11.347 s
                  intr-multi       5.888 s       5.580 s
--------------------------------------------------------
radix_sort_lam    intr-singl       9.715 s       9.342 s
                  intr-multi       4.915 s       4.644 s
--------------------------------------------------------
sum_tree          intr-singl       9.574 s       9.089 s
                  intr-multi       4.435 s       4.456 s
--------------------------------------------------------
c2                intr-singl       0.000 s       0.000 s
                  intr-multi       0.001 s       0.001 s
--------------------------------------------------------
sum_tail          intr-singl       1.327 s       1.186 s
                  intr-multi       1.196 s       1.148 s
--------------------------------------------------------
sum_rec           intr-singl      14.628 s      13.996 s
                  intr-multi       7.586 s       7.265 s

@tjjfvi tjjfvi added this pull request to the merge queue Apr 10, 2024
Merged via the queue into main with commit 7626300 Apr 10, 2024
5 checks passed
@tjjfvi tjjfvi deleted the no-std branch April 10, 2024 20:33
NaoEhSavio pushed a commit to NaoEhSavio/HVM2 that referenced this pull request Apr 30, 2024
This commit implements compilation of HVM terms. The idea, as explained
in an early commit, is to modify the `@F ~ X` deref rule, so that,
instead of unrolling `@F` and letting reductions happen normally, we
instead pass `X` to a compiled procedure that attempts to perform some
local reductions BEFORE allocating `@F`. For example, the HVM term:

add = λa λb (+ a b)

Is compiled to HVM as:

@add = (a (b r))
& HigherOrderCO#1 ~ <a <b r>>

So, if we apply `(add 123 100)`, we will have the following net:

    (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧
    ⟦HigherOrderCO#1⟧            ~ ⟦<a <b r>>⟧

Notice that `(HigherOrderCO#123 (HigherOrderCO#100 R))` is a dynamic net, and everything inside
these «angle_brackets» is part of the "source code" of `@add`, i.e.,
these are static nets. As such, once we send `X` to the compiled
`add()`, we can immediatelly detect that `X` isn't an aux port, but is
actually the main port of the `(HigherOrderCO#123 (HigherOrderCO#100 R))` tree. Furthermore, since
the root of `@add` is also two CON nodes (representing `λa` and `λb`),
we can immediatelly substitute `a <~ HigherOrderCO#123`, `b <~ HigherOrderCO#100`, and `r <~ R`,
performing two "local annihilations" before allocating the body of
`@add`. As a result, we'll have the following net:

    R    ~ ⟦r⟧
    ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧

Now, we have an OP2 node connected to the number HigherOrderCO#1. Normally, that
would require 4 rewrites to reduce to normal form:

    R  ~ r
    HigherOrderCO#1 ~ <HigherOrderCO#123 <HigherOrderCO#100 r>>
    -------------------- OP2
    R    ~ r
    HigherOrderCO#123 ~ <HigherOrderCO#1 <HigherOrderCO#100 r>>
    -------------------- OP1
    R     ~ r
    #+123 ~ <HigherOrderCO#100 r>
    ---------------- OP2
    R    ~ r
    HigherOrderCO#100 ~ <#+123 r>
    ---------------- OP1
    R    ~ r
    #223 ~ r
    -------- subst
    #223 ~ R

Yet, the compiled `add()` function can see, on its local registers, that
`op = HigherOrderCO#1`, `a = HigherOrderCO#123` and `b = HigherOrderCO#100`. As such, it doesn't need to
allocate any OP2 node, and can shortcut the reduction directly to:

    R    ~ ⟦r⟧
    ⟦HigherOrderCO#1⟧ ~ ⟦<HigherOrderCO#123 <HigherOrderCO#100 r>>⟧
    ------------------------ OP2 + OP1 + OP2 + OP1
    R ~ #223

Which bypasses the runtime entirely, saving several allocations and
redex pushing/popping/matching.

Sadly, Rust functions, unlike interaction nets, obey an evaluation
order. As such, keeping a mini "local interaction net runtime" on
registers would be inpractical. As such, we make a choice on the order
that we traverse the "static net"; specifically, we first traverse the
root tree, then the redex trees, in order. This is relevant, because it
means that the order matters for which optimizations are used. For
example, in this case, if we first traversed the redex trees, we'd have:

    ⟦HigherOrderCO#1⟧            ~ ⟦<a <b r>>⟧
    (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧
    ----------------------------- alloc `<a <b r>>`
    HigherOrderCO#1              ~ <a <b r>>
    (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ ⟦(a (b r))⟧
    ----------------------------- alloc `(a (b r))`
    HigherOrderCO#1              ~ <a <b r>>
    (HigherOrderCO#123 (HigherOrderCO#100 R)) ~ (a (b r))
    ---------------------------
    ... proceed reduction as normal

I.e., when traversing `HigherOrderCO#1 ~ <a <b r>>`, the compiled `add()` function
would see `a` and `b` (i.e., aux ports) instead of concrete numbers and,
as such, it would be forced to allocate 2 OP2 nodes, `<a <b r>>`, and
the optimization would fail, causing it to fall back to the interpreted
speed. As such, it is important that tools emitting HVMC code to sort
redexes in a way that allows optimizations to be performed more often.
If redexes are sorted respecting the corresponding "strict evaluation"
order, then functions compiled from classical paradigm should always hit
the optimization case.

For illustration, here is the compiled `add()` procedure:

pub fn F_add(&mut self, ptr: Ptr, x: Ptr) -> bool {
  let xx : Ptr;
  let xy : Ptr;
  // fast apply
  if x.tag() == CT0 {
    self.anni += 1;
    xx = self.heap.get(x.val(), P1);
    xy = self.heap.get(x.val(), P2);
    self.heap.free(x.val());
  } else {
    let k1 = self.heap.alloc(1);
    xx = Ptr::new(VR1, k1);
    xy = Ptr::new(VR2, k1);
    self.link(Ptr::new(CT0, k1), x);
  }
  let xyx : Ptr;
  let xyy : Ptr;
  // fast apply
  if xy.tag() == CT0 {
    self.anni += 1;
    xyx = self.heap.get(xy.val(), P1);
    xyy = self.heap.get(xy.val(), P2);
    self.heap.free(xy.val());
  } else {
    let k2 = self.heap.alloc(1);
    xyx = Ptr::new(VR1, k2);
    xyy = Ptr::new(VR2, k2);
    self.link(Ptr::new(CT0, k2), xy);
  }
  let _k3 = Ptr::new(NUM, 0x1);
  let k4 : Ptr;
  // fast op
  if _k3.is_num() && xx.is_num() && xyx.is_num() {
    self.oper += 4;
    k4 = Ptr::new(NUM, self.op(self.op(_k3.val(),xx.val()),xyx.val()));
  } else {
    let k5 = self.heap.alloc(1);
    let k6 = self.heap.alloc(1);
    self.heap.set(k5, P2, Ptr::new(OP2, k6));
    self.link(Ptr::new(VR1,k5), xx);
    self.link(Ptr::new(VR1,k6), xyx);
    self.link(Ptr::new(OP2,k5), _k3);
    k4 = Ptr::new(VR2, k6);
  }
  self.link(k4, xyy);
  return true;
}

Each optimization branch is labelled with a comment. The more
optimization branches are hit, the faster your program will be.

This commit results in a 1.55x speedup in the 'burn' benchmark (the one
that decrements λ-encoded bits in parallel), a 2.94x speedup in a tree
recursive sum, and a 5.64x speedup in a tail recursive sum. Note that
tail recursion was NOT implemented yet, and there are still some
allocations that can be skipped. With a better codegen, the maximum
theoretical speedup should be of around 36x, which is what we obtain by
manually polishing the generated functions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants