Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some optimizations #90

Open
wants to merge 3 commits into
base: fast-exec
Choose a base branch
from

Conversation

grebe
Copy link
Contributor

@grebe grebe commented Aug 25, 2017

I got some speedup by doing things in more of a java way. Some of the things I did:

  • Used array instead of seq
  • Put value in integer array, not in a case class
  • Eliminated as many hash table lookups as possible
  • Don't use case class for assignment operators, use two arrays
  • Use while loops where performance might matter

Obviously, this throws away a lot of the things that makes scala nice and maintainable. I see a pretty nice speedup, though, so it may be worth it. Strangely, when I run Concrete and Executable together, I don't see the speedup, but when I comment out Executable I see it go from low/mid 5 Mhz to just over 6.

@chick
Copy link
Contributor

chick commented Aug 25, 2017

Interesting. It's funny that the array method worked for you, when I did not see much difference. I started using an array method myself. Maybe I made the indexing method too complicated.

BTW, I did the verilator test and some super luminal tests. So here is the current summary for GCD (add in your numbers if convenient). I am little surprised that verilator is so slow, but GCD has a lot of IO for a relatively small circuit and the polling of the valid is probably pretty expensive for it over IPC. The superluminals are not so far away.

I do think the cycles and MHz are not a great way of evaluating this, because what really matters for interpreter is the number of nodes being processed, we should be tracking that. The results below are little clouded because GCD used is slightly different between the fast tests and the interpreter and verilator versions. But it's good to have some idea where we stand.

Method Cycles Run Time MHz
Interpeter 3861276 255.034728 0.015
Verilator 3861276 50.425607 0.077
Fast BigInt 33179980 13.461831 2.465
Fast Int 33179980 5.544332 5.984
Pure Scala 32179980 0.349653 92.034
Pure C 33179980 5.571678 241.705

@grebe
Copy link
Contributor Author

grebe commented Aug 25, 2017

Interesting that verilator is so slow. Also interesting that the superluminals aren't that insanely fast (of course, FPGA probably destroys everything by another order of magnitude), but the fast versions are an order of magnitude closer to the superluminals than they are to the current interpreter, so that's something.

Here's what I see on my laptop, which evidently isn't quite as beefy as yours.

Method Cycles Run Time MHz
fast-exec Concrete 33179980 7.42 4.47
fast-exec Executable 33179980 7.66 4.33
My fast-exec 33179980 5.48 6.06

For whatever reason, when I run my fast exec followed by the Executable (as it is on github), it drops to ~5MHz (and isn't really faster than the Executable one), which is weird. Not really sure what's going on there.

@grebe
Copy link
Contributor Author

grebe commented Aug 27, 2017

OK, my latest push is a little crazy. First, some numbers to excite you:

Method Cycles Run Time MHz
Jasmin 741548392 9.47 78.3

So, that looks pretty incredible. It's kind of cheating. I wrote some code that took the AST and generated Jasmin assembly (Jasmin is an assembler for Java bytecode), I put that code into a stub I had written, put it in a jar in lib/, and holy crap is it fast.

The "compiler" is super unsophisticated. Java is just a stack machine, so I do a traversal of the tree and let the stack grow. I'm not sure if there is a limit where the jvm gets mad or stops performing as well- in this instance, it didn't seem to matter.

I didn't implement the actual operations (add, mux, etc.) in assembly- those are in scala. Their names are hard-coded in the assembly. Adding BigInt shouldn't be too much of a problem. Also, it isn't doing any width checking/wrapping yet, which will ding the performance but shouldn't make this approach harder to do. Each intermediate operation will just need to have a width in the node and to add an argument to the operations for the width.

Is this better than graal? At the moment, I'm doing this in a janky way where changing the circuit requires that I do some copy/pasting, invoke commands, and restart the jvm. I think there is a cleaner way to do this- there are some libraries for generating java bytecode and dynamically loading it like objectweb asm, cglib, or apache bcel. I'm not sure how painful it is to get java to generate the same output as my little assembly generator, but I think it should be doable.

This has the huge benefit that it isn't experimental like graal. Unlike graal, we are "compiling" everything ahead of time, but firrtl traverses the graph a bunch of times. As long as the java bytecode generation process isn't too expensive, another traversal shouldn't be that bad.

I'd be really curious to see how this scales for a larger design. I imagine this is small enough that the jvm just jits everything. I'm not sure what level of granularity the jvm jits things on, so it's possible that an enormo-function wouldn't perform well and we'd need to be a little smarter.

Appendix: commands to build the assembly and include it

jasmin gcd.j
# check that the java classloader won't instantly barf on what we made
java gcd.GCDImpl
jar cvf gcd.jar gcd
mv gcd.jar lib

@chick
Copy link
Contributor

chick commented Aug 27, 2017

That's an awesome looking number. It does sound a little complicated. I've been making slow progress on getting the ast compiler code working. We'll need that to test more complicated circuits like rocket. It would be intriuging to see how that goes, but we'll have to build things out more and get the bit ops working. Must be the eclipse but it's the 3rd time this month I've been looking at project on sourceforge. I hadn't done it before this month in the last couple of years.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants