-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
optimizing register allocation #199
Comments
I like the ideas. I've used push/pop in earlier versions, when I 'just push all at the start' and 'pop'ed all just before The spilling happens here. So let me rephrase what I understand from your earlier paragraph: Example: Choose one of x1, x2, x3, x4 to spill, where the rest of the program looks like this:
In the current situation, CrypOpt would choose x3 or x4 because they are read later. So essentially, beyond only checking when a spillable value is used, add the information on how a value is used. Maybe you can give your code as an example. |
I can believe push/pop and mov saves are pretty much interchangeable given how common both are. I am not sure whether the other idea I proposed can probably be reasonably approximated as a spill-selection model, but perhaps we can do decently well if the model has access to the current register allocation right before the spill. Specifically I wonder whether it would work to rank all spillable registers based on a handwavy estimate of how many instructions of reloads they will require later. Here is a sketch of a heuristic, recursing forward on the remaining program and considering every operation:
The specific behavior I am looking to achieve is that I am not sure whether considering this heuristic during spilling would be enough, though. There's also the question of how |
I see. First, I think there are two things that we're mixing up here. First, which value to put into Did you change the code in the cryptopt-generated file 'accordingly' and see if it actually is faster? That being said, it is already part of the optimizer, which value to load into Re your spilling heuristic. Now I could think of a different ranking system: However, that is easier said than done, because CryptOpt chooses instructions base on where the data (mem or reg) is. And those instructions are selected at the moment when an operation is being implemented. So the look-ahead can only 'see' the operations. It does make sense for the |
Thank you for explaining this circular dependency between instruction selection and register allocation. The model I proposed was substantially different from what you have been describing, and is substantially thwarted by this pass ordering.
Hmm, indeed. And I think I have seen cases where this happens for a two or three instructions, but never not an entire row. Here's the code I have been working with: I started from a 56x~1M cryptopt champion and wrangled mulx/adx usage back into the straightforward alternating pattern, tidying up registers on the fly. The simplified code is 5-10% faster depending on usage pattern. And I am pretty sure I am still using one register too many, needlessly spilling rdi, and missing some more clever optimizations. I agree with your uop-level analysis, but I wonder whether we are overlooking some bottleneck in the instruction chunking or decode phase. I sometimes got the sense that leaving in a redundant instruction made the implementation faster.
No proofs yet, just passes tests. |
Correction: the above cycle counts were not measured on skylake (on which the cryptopt optimization ran), here's actual skylake numbers:
|
I mean, this register allocation trick would be rather easy to implement, if we'd only focus on mulx for now. Let me pseudocode this real quick.
Do you think that could work? I wonder if the distance to the current instruction should be taken into account. The implementation seems to do the job, see below for a snippet from the my current file, after just a handful of evaluations ( SECTION .text
GLOBAL p256_mul
p256_mul:
sub rsp, 264
; rdi contains out1
; rsi contains arg1
; rdx contains arg2
; Operation: x1--x2<-mulx(arg1[0], arg2[0]),
; chooseMulxLoadValue upcoming mulxs: x1-x2, x41-x42, x99-x100, x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":4,"arg2[0]":4} only two candidates, and they both have the same count value of 4. returning null.
;-- allocation: , calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2 rdx, arg2[0] rdx, arg1 rsi
; fr: rax,r10,r11,rcx,r8,r9,rcx,r8,r9
mov rax, rdx; preserving value of arg2 into a new reg
mov rdx, [ rdx + 0x0 ]; saving arg2[0] in rdx.
; currently considered reusable:
;-- allocation: , calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
; fr: r10,r11,rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
; fr: r11,rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg2[0] rdx, arg1 rsi
mulx r11, r10, [ rsi + 0x0 ]; hix2, lox1<- arg1[0] * arg2[0]
; Operation: x41--x42<-mulx(0x100000000, x1),
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, arg1 rsi
; fr: rdx,rcx,r8,r9,rcx,r8,r9
mov rdx, 0x100000000 ; moving imm to reg
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, out1 rdi, 0x100000000 rdx, arg1 rsi
; fr: rcx,r8,r9,rcx,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, 0x100000000 rdx, arg1 rsi
; fr: r8,r9,r8,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, 0x100000000 rdx, arg1 rsi
mulx r8, rcx, r10; hix42, lox41<- 0x100000000 * x1
; Operation: x99--x100<-mulx(arg1[3], arg2[1]),
; chooseMulxLoadValue upcoming mulxs: x99-x100, x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[3]":4,"arg2[1]":4} only two candidates, and they both have the same count value of 4. returning null.
mov rdx, [ rax + 0x8 ]; arg2[1] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
; fr: r9,r9
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, calSv-rbx rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-rbx, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-rbx(rbx) ,calSv-rbp(rbp) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,arg2[1](rdx) ,x99(r9)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: calSv-rbx
; stacking up for calSv-rbx
mov [ rsp - 0x80 ], rbx; spilling calSv-rbx to mem
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, x100 rbx, x41 rcx, out1 rdi, arg2[1] rdx, arg1 rsi
mulx rbx, r9, [ rsi + 0x18 ]; hix100, lox99<- arg1[3] * arg2[1]
; Operation: x32--x33<-mulx(arg1[1], arg2[3]),
; chooseMulxLoadValue upcoming mulxs: x32-x33, x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[1]":4,"arg2[3]":4} only two candidates, and they both have the same count value of 4. returning null.
mov rdx, [ rax + 0x18 ]; arg2[3] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, calSv-rbp rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-rbp, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-rbp(rbp) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,arg2[3](rdx)
; clobs "x32, x33, arg1[1], arg1, arg2[3], arg2, x32_0, x32_1, x33_0, x33_1"; will spare: calSv-rbp
; stacking up for calSv-rbp
mov [ rsp - 0x78 ], rbp; spilling calSv-rbp to mem
;-- allocation: , x1 r10, x2 r11, calSv-r12 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r12, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r12(r12) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,arg2[3](rdx) ,x32(rbp)
; clobs "x32, x33, arg1[1], arg1, arg2[3], arg2, x32_0, x32_1, x33_0, x33_1"; will spare: calSv-r12
; stacking up for calSv-r12
mov [ rsp - 0x70 ], r12; spilling calSv-r12 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[3] rdx, arg1 rsi
mulx r12, rbp, [ rsi + 0x8 ]; hix33, lox32<- arg1[1] * arg2[3]
; Operation: x7--x8<-mulx(arg1[0], arg2[2]),
; chooseMulxLoadValue upcoming mulxs: x7-x8, x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":3,"arg2[2]":4} choosing arg2[2] because of its higher count value of 4
mov rdx, [ rax + 0x10 ]; arg2[2] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, x33 r12, calSv-r13 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r13, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r13(r13) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,arg2[2](rdx)
; clobs "x7, x8, arg1[0], arg1, arg2[2], arg2, x7_0, x7_1, x8_0, x8_1"; will spare: calSv-r13
; stacking up for calSv-r13
mov [ rsp - 0x68 ], r13; spilling calSv-r13 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, calSv-r14 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r14, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r14(r14) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,arg2[2](rdx) ,x7(r13)
; clobs "x7, x8, arg1[0], arg1, arg2[2], arg2, x7_0, x7_1, x8_0, x8_1"; will spare: calSv-r14
; stacking up for calSv-r14
mov [ rsp - 0x60 ], r14; spilling calSv-r14 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg2[2] rdx, arg1 rsi
mulx r14, r13, [ rsi + 0x0 ]; hix8, lox7<- arg1[0] * arg2[2]
; Operation: x68--x69<-mulx(arg1[2], arg2[0]),
; chooseMulxLoadValue upcoming mulxs: x68-x69, x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[2]":4,"arg2[0]":3} choosing arg1[2] because of its higher count value of 4
mov rdx, [ rsi + 0x10 ]; arg1[2] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, calSv-r15 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg1[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling calSv-r15, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,calSv-r15(r15) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,arg1[2](rdx)
; clobs "x68, x69, arg1[2], arg1, arg2[0], arg2, x68_0, x68_1, x69_0, x69_1"; will spare: calSv-r15
; stacking up for calSv-r15
mov [ rsp - 0x58 ], r15; spilling calSv-r15 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, out1 rdi, arg1[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling out1, because I am out of ideas
; allocs: out1(rdi) ,arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,arg1[2](rdx) ,x68(r15)
; clobs "x68, x69, arg1[2], arg1, arg2[0], arg2, x68_0, x68_1, x69_0, x69_1"; will spare: out1
; stacking up for out1
mov [ rsp - 0x50 ], rdi; spilling out1 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, x69 rdi, arg1[2] rdx, arg1 rsi
mulx rdi, r15, [ rax + 0x0 ]; hix69, lox68<- arg1[2] * arg2[0]
; Operation: x3--x4<-mulx(arg1[0], arg2[1]),
; chooseMulxLoadValue upcoming mulxs: x3-x4, x72-x73, x111-x112, x105-x106, x51-x52, x16-x17, x26-x27, x20-x21, x95-x96, x11-x12, x45-x46, x57-x58, x132-x133, x122-x123, x84-x85, x78-x79, x138-x139, x126-x127 counters {"arg1[0]":2,"arg2[1]":3} choosing arg2[1] because of its higher count value of 3
mov rdx, [ rax + 0x8 ]; arg2[1] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x100 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x100, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x100(rbx) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,x68(r15) ,x69(rdi) ,arg2[1](rdx)
; clobs "x3, x4, arg1[0], arg1, arg2[1], arg2, x3_0, x3_1, x4_0, x4_1"; will spare: x100
; stacking up for x100
mov [ rsp - 0x48 ], rbx; spilling x100 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x99 r9, arg2 rax, x32 rbp, x3 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x99, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x1(r10) ,x2(r11) ,x41(rcx) ,x42(r8) ,x99(r9) ,x32(rbp) ,x33(r12) ,x7(r13) ,x8(r14) ,x68(r15) ,x69(rdi) ,arg2[1](rdx) ,x3(rbx)
; clobs "x3, x4, arg1[0], arg1, arg2[1], arg2, x3_0, x3_1, x4_0, x4_1"; will spare: x99
; stacking up for x99
mov [ rsp - 0x40 ], r9; spilling x99 to mem
;-- allocation: , x1 r10, x2 r11, x33 r12, x7 r13, x8 r14, x68 r15, x42 r8, x4 r9, arg2 rax, x32 rbp, x3 rbx, x41 rcx, x69 rdi, arg2[1] rdx, arg1 rsi
mulx r9, rbx, [ rsi + 0x0 ]; hix4, lox3<- arg1[0] * arg2[1]
|
Thanks, looks good at the first sight. I kicked off a new run. |
This new optimization closed most of the gap. I tried to refine it further, but hit a snag. commit dbc4593a4dba42b6e2bf86a52b24f0b29a8025b8 (HEAD)
Author: Andres Erbsen <[email protected]>
Date: Fri Sep 15 18:01:24 2023 +0000
tweak mulx load value selection
diff --git a/src/model/model.class.ts b/src/model/model.class.ts
index b9e1d7b..f723158 100644
--- a/src/model/model.class.ts
+++ b/src/model/model.class.ts
@@ -219,7 +219,7 @@ export class Model {
}
// Idea is:
- // for each candidate, count how many times it is used as an argument in the upcoming mulx operations
+ // for each candidate, count how many consecutive upcoming mulx operations use it
// initalise counters:
const counters: { [candidateName: string]: number } = candidates.reduce(
@@ -231,11 +231,11 @@ export class Model {
.filter((op) => op.operation == "mulx");
msg += ` upcoming mulxs: ${upcomingMulxOps.map((m) => m.name.join("-")).join(", ")}`;
- upcomingMulxOps.forEach((op) => {
- op.arguments.forEach((arg) => {
- if (arg in counters) {
- counters[arg]++;
- }
+ candidates.forEach((arg) => {
+ upcomingMulxOps.every((op) => {
+ if (!op.arguments.includes(arg)) { return false; }
+ counters[arg]++;
+ return true;
});
});
msg += ` counters ${JSON.stringify(counters)}`;
diff --git a/src/registerAllocator/RegisterAllocator.class.ts b/src/registerAllocator/RegisterAllocator.class.ts
index 9d604a2..ca784ba 100644
--- a/src/registerAllocator/RegisterAllocator.class.ts
+++ b/src/registerAllocator/RegisterAllocator.class.ts
@@ -593,7 +593,7 @@ export class RegisterAllocator {
}
}
- if (caf(AllocationFlags.ONE_IN_MUST_BE_IN_RDX) && !inAllocations.includes(Register.rdx)) {
+ if (caf(AllocationFlags.ONE_IN_MUST_BE_IN_RDX) && this.getVarnameFromStore({ store: Register.rdx }, false) != "" && !inAllocations.includes(this.getVarnameFromStore({ store: Register.rdx }, true))) {
// since in inAllocations there is no rdx (otherwise we wont be in this branch)
// we need to move one of inAllocations to rdx
@@ -751,6 +751,8 @@ export class RegisterAllocator {
(inA, i) =>
isRegister(inA) &&
!this._clobbers.has(this.getVarnameFromStore({ store: inA })) &&
+ // Error: Tried to find the name of data being stored at rdx, but found 0: [] instead of ONE in the current Allocations. TSNH. Abort.
+ // ignoring the error generates incorrect assembly
!Model.hasDependants(allocationReq.in[i]),
) as Register[];
this.addToPreInstructions(Logger.log(`; currently considered reusable: ${reusable.join(", ")}`) ?? ""); Without the middle change, I also encountered the following beauty which is likely also responsible for some missed optimization of mulx/adx chains: ; Operation: x41--x42<-mulx(0x100000000, x1),
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, x2 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x8 rdi, x1 rdx, arg1 rsi
; fr: none.
; freeing x2 (r8) no dependants anymore
mov r8, 0x100000000 ; moving imm to reg
; chooseMulxLoadValue upcoming mulxs: x41-x42, x78-x79, x99-x100, x26-x27, x45-x46, x57-x58, x122-x123, x132-x133, x138-x139, x126-x127 counters {"0x100000000":1,"x1":1} only two candidates, and they both have the same count value of 1. returning null.
mov rdx, rdx; x1 to rdx
; currently considered reusable: rdx
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, 0x100000000 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x8 rdi, x41 rdx, arg1 rsi
; fr: none.
; freeing x8 (rdi) no dependants anymore
;-- allocation: , x14 OF, x16 r10, x17 r11, x68 r12, x5 r13, x4 r14, x9 r15, 0x100000000 r8, x12 r9, arg2 rax, x13 rbp, 0x0 rbx, x51 rcx, x42 rdi, x41 rdx, arg1 rsi
mulx rdi, rdx, r8; hix42, lox41<- 0x100000000 * x1 |
I can't understand what you're trying to do. |
|
So if I understand you correctly, with 2. you're trying to avoid the if (output.some((o) => o.match(/mov rdx, rdx.*/))) {
throw Error("suspicious mov detected");
} after On a very different note, could |
No, I hadn't started on that yet. The attempt in my change was to avoid redundant loads of in-memory inputs already cached in rdx. As for the redundant move after the constant loading, I am not sure what code generates it. Yes, that mulx could be done using shifts. I have not investigated relative performance. |
This is the scenario that I was trying to optimize:
; Operation: x99--x100<-mulx(arg1[3], arg2[1]),
; chooseMulxLoadValue upcoming mulxs: x99-x100, x105-x106, x51-x52, x84-x85, x11-x12, x78-x79, x41-x42, x45-x46, x57-x58, x132-x133, x122-x123, x138-x139, x126-x127 counters {"arg1[3]":2,"arg2[1]":1} choosing arg1[3] because of its higher count value of 2
mov rdx, [ rsi + 0x18 ]; arg1[3] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, x68 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x69 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x69, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x68(r12) ,x69(rbp) ,arg1[3](rdx)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: x69
; stacking up for x69
mov [ rsp + 0x8 ], rbp; spilling x69 to mem
;-- allocation: , x1 r10, x2 r11, x68 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x68, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x68(r12) ,arg1[3](rdx) ,x99(rbp)
; clobs "x99, x100, arg1[3], arg1, arg2[1], arg2, x99_0, x99_1, x100_0, x100_1"; will spare: x68
; stacking up for x68
mov [ rsp + 0x10 ], r12; spilling x68 to mem
;-- allocation: , x1 r10, x2 r11, x100 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg1[3] rdx, arg1 rsi
mulx r12, rbp, [ rax + 0x8 ]; hix100, lox99<- arg1[3] * arg2[1]
; Operation: x105--x106<-mulx(arg1[3], arg2[2]),
; chooseMulxLoadValue upcoming mulxs: x105-x106, x51-x52, x84-x85, x11-x12, x78-x79, x41-x42, x45-x46, x57-x58, x132-x133, x122-x123, x138-x139, x126-x127 counters {"arg1[3]":1,"arg2[2]":1} only two candidates, and they both have the same count value of 1. returning null.
mov rdx, [ rax + 0x10 ]; arg2[2] to rdx
; currently considered reusable:
;-- allocation: , x1 r10, x2 r11, x100 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x100, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x99(rbp) ,x100(r12) ,arg2[2](rdx)
; clobs "x105, x106, arg1[3], arg1, arg2[2], arg2, x105_0, x105_1, x106_0, x106_1"; will spare: x100
; stacking up for x100
mov [ rsp + 0x18 ], r12; spilling x100 to mem
;-- allocation: , x1 r10, x2 r11, x105 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x99 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
; fr: none.
; freeing, i.e. spilling x99, because I am out of ideas
; allocs: arg1(rsi) ,arg2(rax) ,x7(rcx) ,x8(r8) ,x3(r9) ,x4(rbx) ,x20(r14) ,x21(r13) ,x1(r10) ,x2(r11) ,x16(rdi) ,x17(r15) ,x99(rbp) ,arg2[2](rdx) ,x105(r12)
; clobs "x105, x106, arg1[3], arg1, arg2[2], arg2, x105_0, x105_1, x106_0, x106_1"; will spare: x99
; stacking up for x99
mov [ rsp + 0x20 ], rbp; spilling x99 to mem
;-- allocation: , x1 r10, x2 r11, x105 r12, x21 r13, x20 r14, x17 r15, x8 r8, x3 r9, arg2 rax, x106 rbp, x4 rbx, x7 rcx, x16 rdi, arg2[2] rdx, arg1 rsi
mulx rbp, r12, [ rsi + 0x18 ]; hix106, lox105<- arg1[3] * arg2[2] version infoandreser@andreser ~/CryptOpt % git log -1 HEAD~
commit b9608a038c9c6b92d81a4fefcc8e57add6cac778
Merge: b92a58c 31e2494
Author: Joel <[email protected]>
Date: Fri Sep 15 10:02:39 2023 +0930
Merge branch 'dev' into feature/reg199
andreser@andreser ~/CryptOpt % git diff HEAD~
diff --git a/src/model/model.class.ts b/src/model/model.class.ts
index b9e1d7b..f723158 100644
--- a/src/model/model.class.ts
+++ b/src/model/model.class.ts
@@ -219,7 +219,7 @@ export class Model {
}
// Idea is:
- // for each candidate, count how many times it is used as an argument in the upcoming mulx operations
+ // for each candidate, count how many consecutive upcoming mulx operations use it
// initalise counters:
const counters: { [candidateName: string]: number } = candidates.reduce(
@@ -231,11 +231,11 @@ export class Model {
.filter((op) => op.operation == "mulx");
msg += ` upcoming mulxs: ${upcomingMulxOps.map((m) => m.name.join("-")).join(", ")}`;
- upcomingMulxOps.forEach((op) => {
- op.arguments.forEach((arg) => {
- if (arg in counters) {
- counters[arg]++;
- }
+ candidates.forEach((arg) => {
+ upcomingMulxOps.every((op) => {
+ if (!op.arguments.includes(arg)) { return false; }
+ counters[arg]++;
+ return true;
});
});
msg += ` counters ${JSON.stringify(counters)}`; |
And the only examples with the constant-loading sillyness are from CryptOpt with my change trying to fix the issue in my last comment, so it's likely that I just made matters worse there and baseline CryptOpt does not have this issue. |
It looks like this line forgets that |
With the linked PR, I encountered the following sillyness: mov r8, 0xffffffff00000001 ; moving imm to reg
mov rdx, r8; 0xffffffff00000001 to rdx
mulx r11, r8, rcx; hix52, lox51<- 0xffffffff00000001 * x1 |
I'll look at this and the PR with the start of next week (04 October or so) |
ironically, manually editing this out can.make the program faster or slower |
CryptOpt code contains more spills than comparable handwritten code, and sometimes more than compiler-generated code. It is not clear whether this is important for performance, but I do have an implementation where CryptOpt is not winning over naive handwritten assembly and that exhibits this pattern.
What if register allocation first identified uses that can be unspilled using x86 memory operands, "allocated" temporaties for which all uses are of this form, and then allocated the remaining variables into registers based on the remaining uses. In particular, I believe this generic heuristic would do a decent job on the adx-multiplication pattern of
mulx a0 b0; mulx a0 b1; mulx b0 b2 ... ; mulx a1 b0; ...
: allb
can stay in memory and only onea
needs to be live at a time. This suggestion is based on the hypothesis that memory-operand access is cheaper than a separate load even if that load is amortized over multiple instructions; I am not sure whether this is always true.Another guess for a tweak would be to consider would be spilling and restoring callee-saved registers using push/pop at the very beginning and end of the function.
Have you tried something like these options already?
The text was updated successfully, but these errors were encountered: