-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long variable names significantly slow down assembly #100
Comments
Can you provide a link to before and after commits for comparison? |
beebasm memoizes every symbol that gets assigned. When a symbol is assigned in a for loop the name is mangled so that the symbol gets a new name each time round the loop. A similar scheme mangles names defined in macro invocations and nested scopes. In the current release the mangled name is like a file system path. It contains scope IDs for the current scope and the parent scopes. Every symbol lookup reconstructs this path name before doing the lookup. It's not quick. However, the current scope has an ID that is unique across scopes and invocations. Coupled with the FOR counter this provides a unique ID that can be used instead of the path. I've implemented this in mungre/simplesuffix. Can you try it out? This will be quicker, but I don't know if it accounts for a significant part of your slowdown. |
I've just given it ago. Unfortunately, I can't detect a difference. I'll
get my github repo into a position I can show the difference.
…On Sun, 26 May 2024 at 18:49, mungre ***@***.***> wrote:
beebasm memoizes every symbol that gets assigned. When a symbol is
assigned in a for loop the name is mangled so that the symbol gets a new
name each time round the loop. A similar scheme mangles names defined in
macro invocations and nested scopes.
In the current release the mangled name is like a file system path. It
contains scope IDs for the current scope and the parent scopes. Every
symbol lookup reconstructs this path name before doing the lookup. It's not
quick.
However, the current scope has an ID that is unique across scopes and
invocations. Coupled with the FOR counter this provides a unique ID that
can be used instead of the path.
I've implemented this in mungre/simplesuffix
<https://github.com/stardot/beebasm/tree/mungre/simplesuffix>. Can you
try it out?
This will be quicker, but I don't know if it accounts for a significant
part of your slowdown.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFISK3OOVD4V5YBJCVSLZEIOBLAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZSGI4TQMBSHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Cheers. It was a bit of a punt. People always say that you don't know anything until you measure performance problems. |
I used Windows Performance Analyzer to dig into it and I timed the assembly of bogomandel to test the improvements. For each test I did a couple of runs as warm up and then averaged three more: 6260ms - Original So it's considerably faster now. These changes are also in mungre/simplesuffix. There is still a lot of headroom. I naively assumed that reading the file pointer on a buffered stream would just return a cached value. In fact it queries the system's file pointer and subtracts the remaining buffer space (with a fudge factor for newline conversions that I couldn't be bothered to make sense of). This takes fully 20% of the run time! This may just be a feature of the C-runtime on Windows rather than a cross-platform problem though. A further 20% of the time is spent reading lines from the source file. It's slow because the lines are re-read every time round a for loop, for example. And it's quite complicated keeping track of the file pointers. I might change it to read the whole source file into a vector once. |
Wow! amazing work ! I'll have a play later
…On Thu, 30 May 2024 at 15:57, mungre ***@***.***> wrote:
I used Windows Performance A nalyzer to dig into it and I timed the
assembly of bogomandel <https://github.com/davidgiven/bogomandel> to test
the improvements. For each test I did a couple of runs as warm up and then
averaged three more:
6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each
line
So it's considerably faster now. These changes are also in
mungre/simplesuffix
<https://github.com/stardot/beebasm/tree/mungre/simplesuffix>.
There is still a lot of headroom. I naively assumed that reading the file
pointer on a buffered stream would just return a cached value. In fact it
queries the system's file pointer and subtracts the remaining buffer space
(with a fudge factor for newline conversions that I couldn't be bothered to
make sense of). This takes fully 20% of the run time! This may just be a
feature of the C-runtime on Windows rather than a cross-platform problem
though. A further 20% of the time is spent reading lines from the source
file. It's slow because the lines are re-read every time round a for loop,
for example. And it's quite complicated keeping track of the file pointers.
I might change it to read the whole source file into a vector once.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks for your work. MMFS now builds all versions in 8.5 seconds. under
WSL2
real 0m8.576s
user 0m4.949s
sys 0m1.229s
On Thu, 30 May 2024 at 16:09, Dominic Plunkett ***@***.***>
wrote:
… Wow! amazing work ! I'll have a play later
On Thu, 30 May 2024 at 15:57, mungre ***@***.***> wrote:
> I used Windows Performance A nalyzer to dig into it and I timed the
> assembly of bogomandel <https://github.com/davidgiven/bogomandel> to
> test the improvements. For each test I did a couple of runs as warm up and
> then averaged three more:
>
> 6260ms - Original
> 5650ms - With simplified name mangling (as above)
> 3875ms - With name mangling removed
> 3660ms - With IsSymbolDefined fixed
> 3000ms - Reusing LineParser rather than constructing a new one for each
> line
>
> So it's considerably faster now. These changes are also in
> mungre/simplesuffix
> <https://github.com/stardot/beebasm/tree/mungre/simplesuffix>.
>
> There is still a lot of headroom. I naively assumed that reading the file
> pointer on a buffered stream would just return a cached value. In fact it
> queries the system's file pointer and subtracts the remaining buffer space
> (with a fudge factor for newline conversions that I couldn't be bothered to
> make sense of). This takes fully 20% of the run time! This may just be a
> feature of the C-runtime on Windows rather than a cross-platform problem
> though. A further 20% of the time is spent reading lines from the source
> file. It's slow because the lines are re-read every time round a for loop,
> for example. And it's quite complicated keeping track of the file pointers.
> I might change it to read the whole source file into a vector once.
>
> —
> Reply to this email directly, view it on GitHub
> <#100 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AEVVFIWDOJG3ZMMWGUG4JCDZE446XAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZZHAYTGMZTGU>
> .
> You are receiving this because you authored the thread.Message ID:
> ***@***.***>
>
|
Nice one. I guess that's quicker! Thanks for trying it out. I did another simple change to read the whole source file in one go. It knocked another third off the run-time. 6260ms - Original It also means that MacroInstance and SourceFile are almost identical now because they both pre-load their content. So I'm somewhat tempted to get rid of them and just use the base class SourceCode. This would be simpler, and easier to optimise further. Maybe another time. |
Just tested the latest version, I can't detect any significant change in
run time now . So maybe WSL2 is better at caching file access.
…On Thu, 30 May 2024 at 20:51, mungre ***@***.***> wrote:
Nice one. I guess that's quicker! Thanks for trying it out.
I did another simple change to read the whole source file in one go. It
knocked another third off the run-time.
6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each
line
1925ms - Load whole source file into a string
It also means that MacroInstance and SourceFile are almost identical now
because they both pre-load their content. So I'm somewhat tempted to get
rid of them and just use the base class SourceCode. This would be simpler,
and easier to optimise further. Maybe another time.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIQ5YHS4C63QSH2RJDTZE57NPAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBQG43DGNJUGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I tried bogomandel on WSL2. It had a much more modest overall improvement from 4.1s to 2.7s. Faster than Windows to start with but not as fast after the work. I also tried MMFS (the lastest master branch from your repo) and the overall improvement was quite small, 52s down to 47s, but I suppose it is doing a lot of non-beebasm stuff. Your computer must be quick though! I haven't directly addressed the original problem. If you notice that adding long symbols names is still making it significantly slower please post an update. |
This is a bit addictive. I did some profiling on Linux. It was spending a surprising amount of time comparing strings and calculating the length of tokens (assembly language instructions, directives, etc.). Symbols are now looked up in SymbolNameCache and represented by SymbolName, which is just an integer index. This was another big speed up and the lengths of variable names shouldn't have much of an effect on performance now. The lengths of tokens are now pre-calculated. Replacing std::map with std::unordered_map in SymbolTable was a bigger win on Linux but made the Windows version ten times slower! I think this is possibly because I'm using a 32-bit compiler on Windows and my hash function wasn't great with a 32-bit hash value. So I haven't included this for now. More bogomandel times on Windows: 6260ms - Original |
Sorry for sending you down this path :). I only get about 15%-20%
improvement . but an improvement is an improvement !
Thanks
…On Mon, 3 Jun 2024 at 17:05, mungre ***@***.***> wrote:
This is a bit addictive. I did some profiling on Linux. It was spending a
surprising amount of time comparing strings and calculating the length of
tokens (assembly language instructions, directives, etc.).
Symbols are now looked up in SymbolNameCache and represented by
SymbolName, which is just an integer index. This was another big speed up
and the lengths of variable names shouldn't have much of an effect on
performance now.
The lengths of tokens are now pre-calculated.
Replacing std::map with std::unordered_map in SymbolTable was a bigger win
on Linux but made the Windows version ten times slower! I think this is
possibly because I'm using a 32-bit compiler on Windows and my hash
function wasn't great with a 32-bit hash value. So I haven't included this
for now.
More bogomandel times on Windows:
6260ms - Original
5650ms - With simplified name mangling
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each
line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache (new)
1370ms - Calculate token lengths at compile time (new)
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIQ2Z7BTOU6ZQLTUKELZFSH5ZAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBVGU4TMMJYGA>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
(Note that changes are now in mungre/perf.) I came up with a hash function that worked on Windows so it's using std::unordered_map rather than std::map. This made the improvement from SymbolName and SymbolNameCache very slight, and it was quite an extensive change, so I've removed it. bogomandel on Windows: 6260ms - Original Then I went back to Linux and it still seemed ridiculously slow. I profiled the code and the results didn't make any sense. It was like it wasn't being optimised. And then the penny dropped. bogomandel on Linux: 3.2s originally MMFS on WSL2: The latest commit in mungre/perf adds the -O3 flag to the compiler options in CMakeLists.txt. And it's much quicker. |
it turns out I have already been using -O3 with the make file in the src
directory.
…On Tue, 11 Jun 2024 at 18:24, mungre ***@***.***> wrote:
(Note that changes are now in mungre/perf
<https://github.com/stardot/beebasm/tree/mungre/perf>.)
I came up with a hash function that worked on Windows so it's using
std::unordered_map rather than std::map. This made the improvement from
SymbolName and SymbolNameCache very slight, and it was quite an extensive
change, so I've removed it.
bogomandel on Windows:
6260ms - Original
5650ms - With simplified name mangling (as above)
3875ms - With name mangling removed
3660ms - With IsSymbolDefined fixed
3000ms - Reusing LineParser rather than constructing a new one for each
line
1925ms - Load file into a string
1480ms - Use SymbolName and SymbolNameCache
1370ms - Calculate token lengths at compile time
1350ms - Calculate all token lengths at compile time
1255ms - Using unordered_map and a better hash function
1225ms - Improved ReadFile
1210ms - Removed redundant whitespace from EatWhitespace
1290ms - Removed SymbolName and SymbolNameCache
1200ms - Replaced isdigit, isalpha, toupper
885ms - New GetLine
Then I went back to Linux and it still seemed ridiculously slow. I
profiled the code and the results didn't make any sense. It was like it
wasn't being optimised. And then the penny dropped.
bogomandel on Linux:
3.2s originally
1.5s with latest fixes
0.5s with -O3 !!!!!
MMFS on WSL2:
52s originally
40s original and -O3
36s latest and -O3
The latest commit in mungre/perf
<https://github.com/stardot/beebasm/tree/mungre/perf> adds the -O3 flag
to the compiler options in CMakeLists.txt. And it's much quicker.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIVLASQWXPCA6BJ4MBTZG4XDXAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGI3DONRRGE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
That explains it! I'd been using the cmake stuff thinking I don't remember it being this complicated. I've rolled back the cmake fix because, apparently, cmake will automatically set the -O3 flag if you hold it right, but I'm yet to figure out how. More usefully, I've altered src/Makefile to set the NDEBUG flag, which disables asserts. This made bogomandel about 20% quicker because I'd left a bunch of asserts in tight loops. |
no measurable difference for me, but I'm using gcc 14.1 which may have
better optimisation. I don't know if you are interested -fanalyzer gives
some potential warnings which i haven't checked. "cppcheck --enable=all ."
has some performance suggestions, mainly parameters being passed by const
reference. Again I don't know if these are real issues.
…On Tue, 11 Jun 2024 at 21:16, mungre ***@***.***> wrote:
That explains it! I'd been using the cmake stuff thinking I don't remember
it being this complicated.
I've rolled back the cmake fix because, apparently, cmake will
automatically set the -O3 flag if you hold it right, but I'm yet to figure
out how.
More usefully, I've altered src/Makefile to set the NDEBUG flag, which
disables asserts. This made bogomandel about 20% quicker because I'd left a
bunch of asserts in tight loops.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIVRBG4K4RJDOSZAKFDZG5LKDAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRRGUZTOMRQGU>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
Thanks for the suggestions. -fanalyzer found a missing check for a failed malloc (now fixed) but the complaints about use-after-free and leaking in expressions.cpp seem to be bogus. To double-check I ran the tests with msvc's debug allocator on Windows and I used valgrind's memcheck on Linux while assembling bogomandel. Neither tool grumbled. I've fixed nearly all of the performance warnings from cppcheck. The remaining four are templated types which are numeric in practice, so they are fine passed by value. It made a noticeable improvement to bogomandel on both Windows and Linux (about 15% and 20% respectively). These changes are in mungre/perf again. I upgraded to gcc-13 (which is the most recent available on my Linux install) from gcc-11 and it was a bit quicker. On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout), bogomandel has improved from 1.34s originally to 0.39s now. Similarly, a full build of MMFS (just the beebasm parts) has improved from 12.7s to 9.2s. I think the improvement in MMFS is much less marked than the improvement in bogomandel simply because each run of beebasm for MMFS is doing much less but each run has the same constant overhead. You would probably get a bigger improvement by converting build.sh to a Makefile and running the build in parallel. beebasm can accept string symbols on the command-line now (with -S) so you don't need DEVICE.asm to pass in the device and system name. |
Just had a chance to test the latest changes for MMFS this takes the total
build time down from 8.5seconds to 7 seconds. ( NB not all of this time is
beebasm ) so another useful improvement !
…On Wed, 12 Jun 2024 at 23:05, mungre ***@***.***> wrote:
Thanks for the suggestions. -fanalyzer found a missing check for a failed
malloc (now fixed) but the complaints about use-after-free and leaking in
expressions.cpp seem to be bogus. To double-check I ran the tests with
msvc's debug allocator on Windows and I used valgrind's memcheck on Linux
while assembling bogomandel. Neither tool grumbled.
I've fixed nearly all of the performance warnings from cppcheck. The
remaining four are templated types which are numeric in practice, so they
are fine passed by value. It made a noticeable improvement to bogomandel on
both Windows and Linux (about 15% and 20% respectively).
These changes are in mungre/perf
<https://github.com/stardot/beebasm/tree/mungre/perf> again.
I upgraded to gcc-13 (which is the most recent available on my Linux
install) from gcc-11 and it was a bit quicker.
On Linux using gcc-13 and src/Makefile (i.e. with -O3 throughout),
bogomandel has improved from 1.34s originally to 0.39s now.
Similarly, a full build of MMFS (just the beebasm parts) has improved from
12.7s to 9.2s.
I think the improvement in MMFS is much less marked than the improvement
in bogomandel simply because each run of beebasm for MMFS is doing much
less but each run has the same constant overhead.
You would probably get a bigger improvement by converting build.sh to a
Makefile and running the build in parallel. beebasm can accept string
symbols on the command-line now (with -S) so you don't need DEVICE.asm to
pass in the device and system name.
—
Reply to this email directly, view it on GitHub
<#100 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AEVVFIUNVJHPFDZUHGBWZE3ZHDAZNAVCNFSM6AAAAABIJGD442VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNRTHE3TQMZRHE>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I've been making variable names a little more descriptive in MMFS so instead of MA+&1107 -> channeldata_directory_locked . Assembly has significantly slowed down now.
The text was updated successfully, but these errors were encountered: