Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The emulator is not as fast as it's advertised. :-P #16

Closed
mooskagh opened this issue Jul 3, 2021 · 6 comments
Closed

The emulator is not as fast as it's advertised. :-P #16

mooskagh opened this issue Jul 3, 2021 · 6 comments
Assignees

Comments

@mooskagh
Copy link

mooskagh commented Jul 3, 2021

Sorry for the provocative issue title, and not really a bug but just a piece of feedback. :-)

I've checked ~10 z80 emulation libraries, and most of them claim to be "fast", but it doesn't look like any performance comparison was made for any of them. Possibly anything faster than original z80 is considered "fast", but I believe that bar would be too low.

https://github.com/floooh/chips/blob/master/chips/z80.h is an example of something faster that this library (in my benchmark's it's 2.5x faster). But even it on modern CPU is only ~600 times faster than the real Z80. Which if you calculate CPU cycles is impressive ("works as if z80 clock was 2.0Ghz"), but given that Z80 instructions took much more cycles than instructions in the modern CPUs, I think it may be space to explore the ways to make it faster.

I did run a profiler for this library in my experiments ("on clang -O3, and on g++ -O3"), and as far as I remembered and understood them, the main slowdown seemed to be due to lots of nested function calls including calling self() just to get this of the correct type, during every of instruction decode. One may think that as all function calls are static, compiler would be clever enough to inline them or optimize them out, but it didn't happen neither on clang nor in g++ (both with -O3).

Unfortunately, I didn't keep the profiler stats, but I can try to recreate them if needed.

As a side note not related to this project, I personally am in a search of really fast emulator, which doesn't have to have any precise timings. Even going as far as using memcpy() when decoding LDIR (and checking time till interrupt, whether BC or HL intersect 0 address, or whether they intersect the instruction itself) would be great.

@kosarev
Copy link
Owner

kosarev commented Jul 3, 2021

OK, that sounds a challenge! :-)

To proceed with this we would need to be a bit more specific. So if you can share more sources and figures, that would help.

As to self(), hmm, somehow I don't see it not being expanded with clang++/g++ -O3. Can you provide a reproducer? The just-committed e21e6b3 adds some means to actually see what's going on at the assembly level. One problem I see playing with that new example is that the state module could do better if we replace the switches with indexed accesses in handlers like on_get_reg() -- I'm about to file a ticket for that (EDIT: see #17).

Another evidence of some not very bad performance (even if rather indirect and relative) is comparing to https://github.com/begoon/i8080-core. So here are the numbers I see on my machine when feeding both the emulators with 8080exm.com:

i8080-core (clang -O3 -fstrict-aliasing -fomit-frame-pointer)
23.7764 +- 0.0242 seconds time elapsed  ( +-  0.10% )

z80 (clang++ -O3 -fstrict-aliasing -fomit-frame-pointer -fno-exceptions -fno-rtti -std=c++11)
14.3844 +- 0.0165 seconds time elapsed  ( +-  0.12% )

Then enabling lazy flags (#6, as they currently are in their early and not very polished implementation) adds about 4 more percent to that difference, but that's a different story (and it's not implemented for z80 yet).

Will take a closer look to that implementation mentioned and try to get some performance numbers for it.

Not tracking ticks is not a problem with this emulator, but then how do you know when to fire up interrupts?

Re memset() for ldir: yeah, I have some similar thoughts. The complication is that to do better with repetitive instructions like halt and ldir we need to know it in advance how many ticks we have to spend without being interrupted. Means some API changes. Same for the memory interface.

Overall, I still feel confident that if you are after something very fast, this implementation may fit. Let's troubleshoot. :-)

@kosarev
Copy link
Owner

kosarev commented Jul 3, 2021

Here's what I got on my machine.

For floooh/chips:

 Performance counter stats for './z80-zex' (10 runs):

         97,714.03 msec task-clock                #    1.000 CPUs utilized            ( +-  0.09% )
             1,709      context-switches          #    0.017 K/sec                    ( +- 79.15% )
                 3      cpu-migrations            #    0.000 K/sec                    ( +- 29.24% )
                92      page-faults               #    0.001 K/sec                    ( +-  0.40% )
   335,530,466,287      cycles                    #    3.434 GHz                      ( +-  0.06% )
   977,904,049,337      instructions              #    2.91  insn per cycle           ( +-  0.00% )
   138,361,133,797      branches                  # 1415.980 M/sec                    ( +-  0.00% )
       598,201,931      branch-misses             #    0.43% of all branches          ( +-  0.11% )

           97.7551 +- 0.0837 seconds time elapsed  ( +-  0.09% )

For z80:

 Performance counter stats for './benchmark z80 zexall.com' (10 runs):

         39,477.72 msec task-clock                #    0.998 CPUs utilized            ( +-  0.07% )
             2,498      context-switches          #    0.063 K/sec                    ( +- 41.03% )
                 2      cpu-migrations            #    0.000 K/sec                    ( +- 30.84% )
               129      page-faults               #    0.003 K/sec                    ( +-  0.48% )
   135,614,489,147      cycles                    #    3.435 GHz                      ( +-  0.07% )
   453,141,577,768      instructions              #    3.34  insn per cycle           ( +-  0.00% )
    82,440,777,347      branches                  # 2088.286 M/sec                    ( +-  0.00% )
       255,921,164      branch-misses             #    0.31% of all branches          ( +-  0.84% )

           39.5462 +- 0.0179 seconds time elapsed  ( +-  0.05% )

Are you sure you compile z80 with optimisations enabled?

EDIT: It's also interesting to compare code size. So after stripping the binaries it's 84,856 bytes for z80-zex of floooh/chips and 30,880 for benchmark of zx.

chips.zip
z80.zip

@mooskagh
Copy link
Author

mooskagh commented Jul 3, 2021

I've just retested it (here I used them to walk through Manic Miner game), and it indeed turned out that your emulator is 30% faster.

I'm really sorry for the noise!

I tried both clang and g++, both -O3 and -O2, it's the same everywhere (actually with -O2 the difference is even larger).

So I should have kept using your emulator for my project rather than switching to another one..

Interestingly, I have commits before and after switching [to floooh's emulator] in my github repository, and after switching it does work faster, that's why I was sure it really was more performant. I'm investigating the reason why that happens, but still couldn't reduce the example.

In my project I save/restore the machine state ~3000 times per second, but it should not cause any difference as the memory class is the same for both emulators and the only difference is saving/restoring registers. But doing that 3000 times per second hardly can be the reason for the slowdown.


For the context, the project I used it for is to find the fastest possible playthrough of some ZX Spectrum games, Manic Miner and Jet Set Willy, by doing breadth-first search, which required saving and restoring the state many times of the second, and running the VM until the breakpoint.

My next project is intended to be a game that involves an emulation of retro-futuristic "Z80 data center" on a single server, I hope to emulate at least 300-500 Z80 CPUs in parallel in "Z80 realtime", and I'm currently in the search of emulator library (and before today I was pretty sure I'd take floooh's library, but now it seem it's going to be this one).

@mooskagh
Copy link
Author

mooskagh commented Jul 3, 2021

As I have the old code running, here is the profile. Probably doesn't help, but why not.

pprof142780.1.svg.gz

image
image

@simonowen
Copy link
Collaborator

mooskagh wrote:

My next project is intended to be a game that involves an emulation of retro-futuristic "Z80 data center" on a single server, I hope to emulate at least 300-500 Z80 CPUs in parallel in "Z80 realtime", and I'm currently in the search of emulator library (and before today I was pretty sure I'd take floooh's library, but now it seem it's going to be this one).

Coincidentally, my TileMap project runs lots of Z80 cores in parallel to give a playable game map, currently just for ZX Spectrum titles. I've only pushed that as far as 512 screens for Starquake, which means 512 Z80 cores running in parallel. Like you I didn't care so much for timing accuracy or contention, just that it ran fast enough to maintain normal Spectrum speed. I used a different Z80 core at the time, but I'd be interested in trying this Z80 core in the same project to see how the performance compares.

Are you sure that CPU performance is going to be the bottleneck for you? I think I might have run out of GPU power before CPU, even on my 10-year old quad-core i7 system. Though I did limit myself to converting the display with a pixel shader to improve system compatibility, and I'm sure a modern compute shader could do a much better job if I was willing to lift the system requirements.

Also, Manic Miner and Jet Set Willy are both very LDIR heavy, so a disproportionate amount of the frame time is spent copying 2/3 of the display from back buffer to screen. That might change with other titles? It does explain why you'd like to accelerate that if possible -- was it Gerton Lunter's Spectrum emulation that had an option to do that maybe? :)

@kosarev
Copy link
Owner

kosarev commented Jul 4, 2021

Wow, that's brilliant, guys. My own motivation for better performance is implementing time machine for https://github.com/kosarev/zx so there's an efficient way to move backward and forward in time of an execution session by means of API calls.

@mooskagh, I wonder what zx would need to have to be suitable for project like yours.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants