-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update BitBlt support (primarily for 64-bit ARM) #565
Open
bavison
wants to merge
17
commits into
OpenSmalltalk:Cog
Choose a base branch
from
bavison:bitblt_aarch64
base: Cog
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ENABLE_FAST_BLT is typically not assigned a value even when it is defined, so "#if" form is tecnically a syntax error.
This is not the case when being called from "fuzz" or "bench" test applications. It may also not be accurate if a fast path has been synthesised from a combination of copyBitsFallback and one or more other fast paths.
In some places, sourceForm and destForm were being compared to determine which code path to follow. However, when being called from fuzz or other test tools, these structures aren't used to pass parameters, so the pointers haven't been initialised and default to 0, so the wrong code path is followed. Detect such cases and initialise them from sourceBits and destBits instead, since these will perform the same under equality tests.
This shortcut is triggered more frequently than it used to be, due to improvements in copyLoop() that avoid buffer overruns.
When classed as "wide" because each line is long enough to warrant pipelined prefetching as we go along, the inner loop is unrolled enough that there is at least one prefetch instruction per iteration. Loading the source image can only be done in atoms of 32 bit words due to big-endian packing, so when destination pixels are 8 times wider (or more) than source pixels, the loads happen less frequently than the store atoms (quadwords) and a conditional branch per subblock is required to decide whether to do a load or not, depending on the skew and the number of pixels remaining to process. The 'x' register is only updated once per loop, so an assembly-time constant derived from the unrolling subblock number needs to be factored in, but since the number of pixels remaining decreases as the subblock number increases, this should have been a subtraction. In practice, since only the least-significant bits of the result matter, addition and subtraction behave the same when the source:destination pixel ratio is 8, so the only operations affected were 1->16bpp, 2->32bpp and 1->32bpp. The exact threshold that counts as "wide" depends on the prefetch distance that was selected empirically, but typically would require an operation that is several hundreds of pixels wide.
In fastPathDepthConv (which combines sourceWord colour-depth conversion with another fast path for another combinationRule at a constant colour depth) and fastPathRightToLeft, it could overflow the temporary buffer and thereby corrupt other local variables if the last chunk of a pixel row was 2048 bytes (or just under). This was most likely to happen with 32bpp destination images and widths of about 512 pixels.
For images that were wide enough to invoke intra-line preloads, there was a register clash between the preload address calculation and one of the registers holding the deskewed source pixels (this only occurred once per destination cacheline).
The halftone array is accessed using a hard-coded multiplier of 4 bytes, therefore the type of each element needs to be 32 bit on every platform. `sqInt` is not appropriate for this use, since it is a 64-bit type on 64-bit platforms. Rather than unilaterally introduce C99 stdint types, use `unsigned int` since this wil be 32-bit on both current fast path binary targets.
Sometimes, colour maps are used such that all entries except the first contain the same value. Combined with the fact that only source colour 0 uses colour map entry 0 (any other colours for which all non-0 bits would otherwise be discarded during index generation are forced to use entry 1 instead), this effectively acts as a 2-entry (or 1bpp) map, depending on whether the source colour is 0 or not. This is far more efficiently coded in any fast path by a test against zero, than by a table lookup - it frees up 2 KB, 16 KB or 128 KB of data cache space, depending on whether a 9-, 12- or 15-bit colour map was used. There is an up-front cost to scanning the colour map to see if its entries are of this nature, however in most "normal" colour maps, this scan will rapidly be aborted.
This runs approx 2.6x faster when benchmarked on Cortex-A72 in AArch64.
…nation This makes better use of existing fast paths, and applies to all platforms.
Great! |
hogoww
referenced
this pull request
in hogoww/opensmalltalk-vm
Dec 25, 2021
… true ] on method [ markWeaklingsAndMarkAndFireEphemerons ]
hogoww
referenced
this pull request
in hogoww/opensmalltalk-vm
Dec 25, 2021
…true ] on method [ markWeaklingsAndMarkAndFireEphemerons ] KILLED by 1/10 test cases.
hogoww
referenced
this pull request
in hogoww/opensmalltalk-vm
Feb 26, 2022
… [ attemptToShrink ] 10 test cases.
hogoww
referenced
this pull request
in hogoww/opensmalltalk-vm
Feb 26, 2022
…[ attemptToShrink ] 7/10 Test Cases are NOT EQUIVALENT
guillep
added a commit
to tesonep/opensmalltalk-vm
that referenced
this pull request
May 12, 2023
…iate-keys Verify ephemeron key is not immediate when marking
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The accelerated BitBlt framework was initially targeted at the ARM11, running the AArch32 instruction set (which is the only one it fully supported).
More recent ARMs run much faster, which has enabled more comprehensive testing via the BitBlt fuzz test framework (https://github.com/bavison/SqueakBitBltTest). This has detected a handful of bugs in both the AArch32-specific and the architecture-neutral parts of the fast BitBlt framework. First I address these.
Next, I add a number of BitBlt fast paths written in platform-independent C. The 8-to-32bpp conversion routine is as fast as anything I could manage with hand-crafted AArch64 assembly. Others are useful as reference implementations for other architectures, or to fill in gaps in their abilities (for example, while I've introduced a class of fast paths for colour maps that only feature two distinct colours, I haven't retrospectively written any AArch32 fast paths for them, so the C fast path will be used for them on AArch32).
The fast path that handles operations with scalar halftoning and 32bpp destination images is a bit of a special case, in that it acts to extend the capabilities of other fast paths. It thus accelerates both AArch32 and AArch64.
The most significant commit, however, is the last one. This features a collection of fast paths implemented using inline AArch64 assembly, tuned for Cortex-A72 (as found in the Raspberry Pi 4). Based on the results of profiling, this has an emphasis on operations with a 32bpp destination image.
Operations with any source depth, in conjunction with 22 of the possible combinationRules (including the common sourceWord, pixPaint and alphaBlend rules) should all be accelerated, providing you don't use little-endian pixel packing, vector halftoning, or non-standard colour map rules when converting from different colour depths.
There are additional fast paths for alphaBlend for either a constant source colour, or a source image whose colour map only consists of two different colours (i.e. where the source image is effectively used as a 1bpp mask, despite being of a greater depth).