Some optimizations #90

grebe · 2017-08-25T16:04:46Z

I got some speedup by doing things in more of a java way. Some of the things I did:

Used array instead of seq
Put value in integer array, not in a case class
Eliminated as many hash table lookups as possible
Don't use case class for assignment operators, use two arrays
Use while loops where performance might matter

Obviously, this throws away a lot of the things that makes scala nice and maintainable. I see a pretty nice speedup, though, so it may be worth it. Strangely, when I run Concrete and Executable together, I don't see the speedup, but when I comment out Executable I see it go from low/mid 5 Mhz to just over 6.

chick · 2017-08-25T17:26:32Z

Interesting. It's funny that the array method worked for you, when I did not see much difference. I started using an array method myself. Maybe I made the indexing method too complicated.

BTW, I did the verilator test and some super luminal tests. So here is the current summary for GCD (add in your numbers if convenient). I am little surprised that verilator is so slow, but GCD has a lot of IO for a relatively small circuit and the polling of the valid is probably pretty expensive for it over IPC. The superluminals are not so far away.

I do think the cycles and MHz are not a great way of evaluating this, because what really matters for interpreter is the number of nodes being processed, we should be tracking that. The results below are little clouded because GCD used is slightly different between the fast tests and the interpreter and verilator versions. But it's good to have some idea where we stand.

Method	Cycles	Run Time	MHz
Interpeter	3861276	255.034728	0.015
Verilator	3861276	50.425607	0.077
Fast BigInt	33179980	13.461831	2.465
Fast Int	33179980	5.544332	5.984
Pure Scala	32179980	0.349653	92.034
Pure C	33179980	5.571678	241.705

grebe · 2017-08-25T22:41:50Z

Interesting that verilator is so slow. Also interesting that the superluminals aren't that insanely fast (of course, FPGA probably destroys everything by another order of magnitude), but the fast versions are an order of magnitude closer to the superluminals than they are to the current interpreter, so that's something.

Here's what I see on my laptop, which evidently isn't quite as beefy as yours.

Method	Cycles	Run Time	MHz
fast-exec Concrete	33179980	7.42	4.47
fast-exec Executable	33179980	7.66	4.33
My fast-exec	33179980	5.48	6.06

For whatever reason, when I run my fast exec followed by the Executable (as it is on github), it drops to ~5MHz (and isn't really faster than the Executable one), which is weird. Not really sure what's going on there.

grebe · 2017-08-27T00:21:21Z

OK, my latest push is a little crazy. First, some numbers to excite you:

Method	Cycles	Run Time	MHz
Jasmin	741548392	9.47	78.3

So, that looks pretty incredible. It's kind of cheating. I wrote some code that took the AST and generated Jasmin assembly (Jasmin is an assembler for Java bytecode), I put that code into a stub I had written, put it in a jar in lib/, and holy crap is it fast.

The "compiler" is super unsophisticated. Java is just a stack machine, so I do a traversal of the tree and let the stack grow. I'm not sure if there is a limit where the jvm gets mad or stops performing as well- in this instance, it didn't seem to matter.

I didn't implement the actual operations (add, mux, etc.) in assembly- those are in scala. Their names are hard-coded in the assembly. Adding BigInt shouldn't be too much of a problem. Also, it isn't doing any width checking/wrapping yet, which will ding the performance but shouldn't make this approach harder to do. Each intermediate operation will just need to have a width in the node and to add an argument to the operations for the width.

Is this better than graal? At the moment, I'm doing this in a janky way where changing the circuit requires that I do some copy/pasting, invoke commands, and restart the jvm. I think there is a cleaner way to do this- there are some libraries for generating java bytecode and dynamically loading it like objectweb asm, cglib, or apache bcel. I'm not sure how painful it is to get java to generate the same output as my little assembly generator, but I think it should be doable.

This has the huge benefit that it isn't experimental like graal. Unlike graal, we are "compiling" everything ahead of time, but firrtl traverses the graph a bunch of times. As long as the java bytecode generation process isn't too expensive, another traversal shouldn't be that bad.

I'd be really curious to see how this scales for a larger design. I imagine this is small enough that the jvm just jits everything. I'm not sure what level of granularity the jvm jits things on, so it's possible that an enormo-function wouldn't perform well and we'd need to be a little smarter.

Appendix: commands to build the assembly and include it

jasmin gcd.j
# check that the java classloader won't instantly barf on what we made
java gcd.GCDImpl
jar cvf gcd.jar gcd
mv gcd.jar lib

chick · 2017-08-27T04:57:55Z

That's an awesome looking number. It does sound a little complicated. I've been making slow progress on getting the ast compiler code working. We'll need that to test more complicated circuits like rocket. It would be intriuging to see how that goes, but we'll have to build things out more and get the bit ops working. Must be the eclipse but it's the 3rd time this month I've been looking at project on sourceforge. I hadn't done it before this month in the last couple of years.

grebe added 2 commits August 24, 2017 23:19

Checkpoint

f06f726

Some speedup

d0a7a08

Try Jasmin

c712bb9

chick added the DO NOT MERGE label Nov 19, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Some optimizations #90

Some optimizations #90

grebe commented Aug 25, 2017 •

edited

Loading

chick commented Aug 25, 2017

grebe commented Aug 25, 2017

grebe commented Aug 27, 2017

chick commented Aug 27, 2017

Some optimizations #90

Are you sure you want to change the base?

Some optimizations #90

Conversation

grebe commented Aug 25, 2017 • edited Loading

chick commented Aug 25, 2017

grebe commented Aug 25, 2017

grebe commented Aug 27, 2017

chick commented Aug 27, 2017

grebe commented Aug 25, 2017 •

edited

Loading