Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long variable names significantly slow down assembly #100

Open
dp111 opened this issue May 25, 2024 · 18 comments
Open

Long variable names significantly slow down assembly #100

dp111 opened this issue May 25, 2024 · 18 comments

Comments

@dp111
Copy link

dp111 commented May 25, 2024

I've been making variable names a little more descriptive in MMFS so instead of MA+&1107 -> channeldata_directory_locked . Assembly has significantly slowed down now.

@mungre
Copy link
Collaborator

mungre commented May 26, 2024

Can you provide a link to before and after commits for comparison?

@mungre
Copy link
Collaborator

mungre commented May 26, 2024

beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes.

In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick.

However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path.

I've implemented this in mungre/simplesuffix. Can you try it out?

This will be quicker, but I don't know if it accounts for a significant part of your slowdown.

@dp111
Copy link
Author

dp111 commented May 27, 2024 via email

@mungre
Copy link
Collaborator

mungre commented May 27, 2024

Cheers. It was a bit of a punt. People always say that you don't know anything until you measure performance problems.

@mungre
Copy link
Collaborator

mungre commented May 30, 2024

I used Windows Performance Analyzer to dig into it and I timed the assembly of bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more:

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line

So it's considerably faster now. These changes are also in mungre/simplesuffix.

There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once.

@dp111
Copy link
Author

dp111 commented May 30, 2024 via email

@dp111
Copy link
Author

dp111 commented May 30, 2024 via email

@mungre
Copy link
Collaborator

mungre commented May 30, 2024

Nice one. I guess that's quicker! Thanks for trying it out.

I did another simple change to read the whole source file in one go. It knocked another third off the run-time.

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load whole source file into a string

It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time.

@dp111
Copy link
Author

dp111 commented May 30, 2024 via email

@mungre
Copy link
Collaborator

mungre commented May 30, 2024

I tried bogomandel on WSL2. It had a much more modest overall improvement from 4.1s to 2.7s. Faster than Windows to start with but not as fast after the work.

I also tried MMFS (the lastest master branch from your repo) and the overall improvement was quite small, 52s down to 47s, but I suppose it is doing a lot of non-beebasm stuff. Your computer must be quick though!

I haven't directly addressed the original problem. If you notice that adding long symbols names is still making it significantly slower please post an update.

@mungre
Copy link
Collaborator

mungre commented Jun 3, 2024

This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.).

Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now.

The lengths of tokens are now pre-calculated.

Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now.

More bogomandel times on Windows:

6260ms - Original
5650ms - With simplified name mangling
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache (new)
1370ms - Calculate token lengths at compile time (new)

@dp111
Copy link
Author

dp111 commented Jun 3, 2024 via email

@mungre
Copy link
Collaborator

mungre commented Jun 11, 2024

(Note that changes are now in mungre/perf.)

I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it.

bogomandel on Windows:

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache
1370ms - Calculate token lengths at compile time
1350ms - Calculate all token lengths at compile time
1255ms - Using unordered_map and a better hash function
1225ms - Improved ReadFile
1210ms - Removed redundant whitespace from EatWhitespace
1290ms - Removed SymbolName and SymbolNameCache
1200ms - Replaced isdigit, isalpha, toupper
885ms - New GetLine

Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped.

bogomandel on Linux:

3.2s originally
1.5s with latest fixes
0.5s with -O3 !!!!!

MMFS on WSL2:
52s originally
40s original and -O3
36s latest and -O3

The latest commit in mungre/perf adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker.

@dp111
Copy link
Author

dp111 commented Jun 11, 2024 via email

@mungre
Copy link
Collaborator

mungre commented Jun 11, 2024

That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated.

I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how.

More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops.

@dp111
Copy link
Author

dp111 commented Jun 11, 2024 via email

@mungre
Copy link
Collaborator

mungre commented Jun 12, 2024

Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled.

I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively).

These changes are in mungre/perf again.

I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker.

On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now.

Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s.

I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead.

You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name.

@dp111
Copy link
Author

dp111 commented Jun 15, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants