Long variable names significantly slow down assembly #100

dp111 · 2024-05-25T21:21:42Z

I've been making variable names a little more descriptive in MMFS so instead of MA+&1107 -> channeldata_directory_locked . Assembly has significantly slowed down now.

mungre · 2024-05-26T01:27:16Z

Can you provide a link to before and after commits for comparison?

mungre · 2024-05-26T17:48:48Z

beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes.

In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick.

However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path.

I've implemented this in mungre/simplesuffix. Can you try it out?

This will be quicker, but I don't know if it accounts for a significant part of your slowdown.

dp111 · 2024-05-27T09:37:24Z

I've just given it ago. Unfortunately, I can't detect a difference. I'll get my github repo into a position I can show the difference.

…

On Sun, 26 May 2024 at 18:49, mungre ***@***.***> wrote: beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes. In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick. However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path. I've implemented this in mungre/simplesuffix <https://github.com/stardot/beebasm/tree/mungre/simplesuffix>. Can you try it out? This will be quicker, but I don't know if it accounts for a significant part of your slowdown. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFISK3OOVD4V5YBJCVSLZEIOBLAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSGI4TQMBSHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mungre · 2024-05-27T15:58:45Z

Cheers. It was a bit of a punt. People always say that you don't know anything until you measure performance problems.

mungre · 2024-05-30T14:57:24Z

I used Windows Performance Analyzer to dig into it and I timed the assembly of bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more:

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line

So it's considerably faster now. These changes are also in mungre/simplesuffix.

There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once.

dp111 · 2024-05-30T15:09:55Z

Wow! amazing work ! I'll have a play later

…

On Thu, 30 May 2024 at 15:57, mungre ***@***.***> wrote: I used Windows Performance A nalyzer to dig into it and I timed the assembly of bogomandel <https://github.com/davidgiven/bogomandel> to test the improvements. For each test I did a couple of runs as warm up and then averaged three more: 6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line So it's considerably faster now. These changes are also in mungre/simplesuffix <https://github.com/stardot/beebasm/tree/mungre/simplesuffix>. There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

dp111 · 2024-05-30T18:32:53Z

Thanks for your work. MMFS now builds all versions in 8.5 seconds. under WSL2 real 0m8.576s user 0m4.949s sys 0m1.229s On Thu, 30 May 2024 at 16:09, Dominic Plunkett ***@***.***> wrote:

…

Wow! amazing work ! I'll have a play later On Thu, 30 May 2024 at 15:57, mungre ***@***.***> wrote: > I used Windows Performance A nalyzer to dig into it and I timed the > assembly of bogomandel <https://github.com/davidgiven/bogomandel> to > test the improvements. For each test I did a couple of runs as warm up and > then averaged three more: > > 6260ms - Original > 5650ms - With simplified name mangling (as above) > 3875ms - With name mangling removed > 3660ms - With IsSymbolDefined fixed > 3000ms - Reusing LineParser rather than constructing a new one for each > line > > So it's considerably faster now. These changes are also in > mungre/simplesuffix > <https://github.com/stardot/beebasm/tree/mungre/simplesuffix>. > > There is still a lot of headroom. I naively assumed that reading the file > pointer on a buffered stream would just return a cached value. In fact it > queries the system's file pointer and subtracts the remaining buffer space > (with a fudge factor for newline conversions that I couldn't be bothered to > make sense of). This takes fully 20% of the run time! This may just be a > feature of the C-runtime on Windows rather than a cross-platform problem > though. A further 20% of the time is spent reading lines from the source > file. It's slow because the lines are re-read every time round a for loop, > for example. And it's quite complicated keeping track of the file pointers. > I might change it to read the whole source file into a vector once. > > — > Reply to this email directly, view it on GitHub > <#100 (comment)>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU> > . > You are receiving this because you authored the thread.Message ID: > ***@***.***> >

mungre · 2024-05-30T19:51:30Z

Nice one. I guess that's quicker! Thanks for trying it out.

I did another simple change to read the whole source file in one go. It knocked another third off the run-time.

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load whole source file into a string

It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time.

dp111 · 2024-05-30T20:36:04Z

Just tested the latest version, I can't detect any significant change in run time now . So maybe WSL2 is better at caching file access.

…

On Thu, 30 May 2024 at 20:51, mungre ***@***.***> wrote: Nice one. I guess that's quicker! Thanks for trying it out. I did another simple change to read the whole source file in one go. It knocked another third off the run-time. 6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load whole source file into a string It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIQ5YHS4C63QSH2RJDTZE57NPAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQG43DGNJUGM> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mungre · 2024-05-30T21:39:36Z

I tried bogomandel on WSL2. It had a much more modest overall improvement from 4.1s to 2.7s. Faster than Windows to start with but not as fast after the work.

I also tried MMFS (the lastest master branch from your repo) and the overall improvement was quite small, 52s down to 47s, but I suppose it is doing a lot of non-beebasm stuff. Your computer must be quick though!

I haven't directly addressed the original problem. If you notice that adding long symbols names is still making it significantly slower please post an update.

mungre · 2024-06-03T16:05:26Z

This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.).

Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now.

The lengths of tokens are now pre-calculated.

Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now.

More bogomandel times on Windows:

6260ms - Original
5650ms - With simplified name mangling
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache (new)
1370ms - Calculate token lengths at compile time (new)

dp111 · 2024-06-03T18:28:17Z

Sorry for sending you down this path :). I only get about 15%-20% improvement . but an improvement is an improvement ! Thanks

…

On Mon, 3 Jun 2024 at 17:05, mungre ***@***.***> wrote: This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.). Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now. The lengths of tokens are now pre-calculated. Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now. More bogomandel times on Windows: 6260ms - Original 5650ms - With simplified name mangling 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache (new) 1370ms - Calculate token lengths at compile time (new) — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIQ2Z7BTOU6ZQLTUKELZFSH5ZAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU4TMMJYGA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mungre · 2024-06-11T17:23:49Z

(Note that changes are now in mungre/perf.)

I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it.

bogomandel on Windows:

6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache
1370ms - Calculate token lengths at compile time
1350ms - Calculate all token lengths at compile time
1255ms - Using unordered_map and a better hash function
1225ms - Improved ReadFile
1210ms - Removed redundant whitespace from EatWhitespace
1290ms - Removed SymbolName and SymbolNameCache
1200ms - Replaced isdigit, isalpha, toupper
885ms - New GetLine

Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped.

bogomandel on Linux:

3.2s originally
1.5s with latest fixes
0.5s with -O3 !!!!!

MMFS on WSL2:
52s originally
40s original and -O3
36s latest and -O3

The latest commit in mungre/perf adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker.

dp111 · 2024-06-11T18:27:18Z

it turns out I have already been using -O3 with the make file in the src directory.

…

On Tue, 11 Jun 2024 at 18:24, mungre ***@***.***> wrote: (Note that changes are now in mungre/perf <https://github.com/stardot/beebasm/tree/mungre/perf>.) I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it. bogomandel on Windows: 6260ms - Original 5650ms - With simplified name mangling (as above) 3875ms - With name mangling removed 3660ms - With IsSymbolDefined fixed 3000ms - Reusing LineParser rather than constructing a new one for each line 1925ms - Load file into a string 1480ms - Use SymbolName and SymbolNameCache 1370ms - Calculate token lengths at compile time 1350ms - Calculate all token lengths at compile time 1255ms - Using unordered_map and a better hash function 1225ms - Improved ReadFile 1210ms - Removed redundant whitespace from EatWhitespace 1290ms - Removed SymbolName and SymbolNameCache 1200ms - Replaced isdigit, isalpha, toupper 885ms - New GetLine Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped. bogomandel on Linux: 3.2s originally 1.5s with latest fixes 0.5s with -O3 !!!!! MMFS on WSL2: 52s originally 40s original and -O3 36s latest and -O3 The latest commit in mungre/perf <https://github.com/stardot/beebasm/tree/mungre/perf> adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIVLASQWXPCA6BJ4MBTZG4XDXAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI3DONRRGE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mungre · 2024-06-11T20:16:12Z

That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated.

I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how.

More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops.

dp111 · 2024-06-11T20:37:45Z

no measurable difference for me, but I'm using gcc 14.1 which may have better optimisation. I don't know if you are interested -fanalyzer gives some potential warnings which i haven't checked. "cppcheck --enable=all ." has some performance suggestions, mainly parameters being passed by const reference. Again I don't know if these are real issues.

…

On Tue, 11 Jun 2024 at 21:16, mungre ***@***.***> wrote: That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated. I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how. More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIVRBG4K4RJDOSZAKFDZG5LKDAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGUZTOMRQGU> . You are receiving this because you authored the thread.Message ID: ***@***.***>

mungre · 2024-06-12T22:04:49Z

Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled.

I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively).

These changes are in mungre/perf again.

I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker.

On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now.

Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s.

I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead.

You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name.

dp111 · 2024-06-15T09:18:33Z

Just had a chance to test the latest changes for MMFS this takes the total build time down from 8.5seconds to 7 seconds. ( NB not all of this time is beebasm ) so another useful improvement !

…

On Wed, 12 Jun 2024 at 23:05, mungre ***@***.***> wrote: Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled. I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively). These changes are in mungre/perf <https://github.com/stardot/beebasm/tree/mungre/perf> again. I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker. On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now. Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s. I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead. You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name. — Reply to this email directly, view it on GitHub <#100 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AEVVFIUNVJHPFDZUHGBWZE3ZHDAZNAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTHE3TQMZRHE> . You are receiving this because you authored the thread.Message ID: ***@***.***>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long variable names significantly slow down assembly #100

Long variable names significantly slow down assembly #100

dp111 commented May 25, 2024

mungre commented May 26, 2024

mungre commented May 26, 2024

dp111 commented May 27, 2024 via email

mungre commented May 27, 2024

mungre commented May 30, 2024 •

edited

Loading

dp111 commented May 30, 2024 via email

dp111 commented May 30, 2024 via email

mungre commented May 30, 2024

dp111 commented May 30, 2024 via email

mungre commented May 30, 2024

mungre commented Jun 3, 2024

dp111 commented Jun 3, 2024 via email

mungre commented Jun 11, 2024

dp111 commented Jun 11, 2024 via email

mungre commented Jun 11, 2024

dp111 commented Jun 11, 2024 via email

mungre commented Jun 12, 2024

dp111 commented Jun 15, 2024 via email

Long variable names significantly slow down assembly #100

Long variable names significantly slow down assembly #100

Comments

dp111 commented May 25, 2024

mungre commented May 26, 2024

mungre commented May 26, 2024

dp111 commented May 27, 2024 via email

mungre commented May 27, 2024

mungre commented May 30, 2024 • edited Loading

dp111 commented May 30, 2024 via email

dp111 commented May 30, 2024 via email

mungre commented May 30, 2024

dp111 commented May 30, 2024 via email

mungre commented May 30, 2024

mungre commented Jun 3, 2024

dp111 commented Jun 3, 2024 via email

mungre commented Jun 11, 2024

dp111 commented Jun 11, 2024 via email

mungre commented Jun 11, 2024

dp111 commented Jun 11, 2024 via email

mungre commented Jun 12, 2024

dp111 commented Jun 15, 2024 via email

mungre commented May 30, 2024 •

edited

Loading