-
Notifications
You must be signed in to change notification settings - Fork 806
Optimizing assembly code
Sometimes the simplest way to write something in assembly code isn't the best. All of your resources are limited: CPU speed, ROM size, RAM space, register use. You can rewrite code to use those resources more efficiently (sometimes by trading one for another).
Most of these tricks come from Jeff's GB Assembly Code Tips v1.0, WikiTI's Z80 Optimization page, z80 Heaven's optimization tutorial, and GBDev Wiki's ASM Snippets. (Note that the Game Boy CPU's assembly is called SM83, or colloquially GBZ80. It is not the same as Z80 assembly; the Z80 CPU has more registers and some different instructions.)
WikiTI's advice fully applies here:
Note that the following tricks act much like a peephole optimizer and are the last optimization step: remember to first optimize your algorithm and register allocation before applying any of the following if you really want the fastest speed and the smallest code.
Also note that nearly every trick turns the code less understandable and documenting them is a good idea. You can easily forgot after a while without reading parts of the code.
Be warned that some tricks are not exactly equivalent to the normal way and may have exceptions on their use; comments warn about them. Some tricks apply to other cases, but again you have to be careful.
There are some tricks that are nothing more than the correct use of the available instructions on the Z80. Keeping an instruction set summary helps to visualize what you can do during coding.
(There's also a "cheat sheet" table of instructions summarizing their bytes, cycles, and affected flags, if you don't need a long listing of what each one does.)
-
8-bit registers
- Set
a
to 0 - Increment or decrement
a
- Multiply
a
by 2 - Invert the bits of
a
- Rotate the bits of
a
- Load from HRAM to
a
or froma
to HRAM - Set
a
to some constant minusa
- Set
a
to one constant or another depending on the carry flag - Increment or decrement
a
when the carry flag is set - Increment or decrement
a
when the carry flag is not set - Toggle
a
between two different constants - Divide
a
by 8 (shifta
right 3 bits) - Divide
a
by 16 (shifta
right 4 bits) - Set
a
to some value plus or minus carry - Add or subtract the carry flag from a register besides
a
- Reverse the bits of
a
- Count the set bits of a register besides
a
- Merge some bits of a register with
a
- Set
-
16-bit registers
- Multiply
hl
by 2 - Add
a
to a 16-bit register - Subtract a constant from a 16-bit register
- Set a 16-bit register to
a
plus a constant - Set a 16-bit register to
a
multiplied by 16 - Sign-extend
a
into a 16-bit register - Increment or decrement a 16-bit register
- Add or subtract the carry flag from a 16-bit register
- Load from an address to
hl
- Load from an address to
sp
- Exchange two 16-bit registers
- Subtract two 16-bit registers
- Load two constants into a register pair
- Load a constant into
[hl]
- Increment or decrement
[hl]
- Load a constant into
[hl]
and increment or decrementhl
- Load a register into
[hl]
and increment or decrementhl
- Multiply
- Branching (control flow)
- Subroutines (functions)
- Jump and lookup tables
Don't do this:
ld a, 0 ; 2 bytes, 2 cycles; no changes to flags
Instead, do this:
xor a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1
Or do this:
sub a ; 1 byte, 1 cycle, sets flags C to 0 and Z to 1
Don't use the optimized versions if you need to preserve flags. As such, ld a, 0
must be left intact in the code below:
ld a, [wIsTrainerBattle]
and a ; sets flag Z to 1 if [wIsTrainerBattle] == 0 or else to 0
ld a, 0 ; sets a to 0 without affecting flags
jr nz, .is_trainer_battle
... ; is not trainer battle
When possible, avoid doing this:
add 1 ; 2 bytes, 2 cycles; sets carry for -1 to 0 overflow
sub 1 ; 2 bytes, 2 cycles; sets carry for 0 to -1 underflow
If you don't need to set the carry flag, then do this:
inc a ; 1 byte, 1 cycle
dec a ; 1 byte, 1 cycle
Don't do this:
sla a ; 2 bytes, 2 cycles
Instead, do this:
add a ; 1 byte, 1 cycle
Don't do this:
xor $ff ; 2 bytes, 2 cycles
Instead, do this:
cpl ; 1 byte, 1 cycle
Don't do this:
rl a ; 2 bytes, 2 cycles; updates Z and C flags
rlc a ; 2 bytes, 2 cycles; updates Z and C flags
rr a ; 2 bytes, 2 cycles; updates Z and C flags
rrc a ; 2 bytes, 2 cycles; updates Z and C flags
Instead, do this:
rla ; 1 byte, 1 cycle; updates C flag
rlca ; 1 byte, 1 cycle; updates C flag
rra ; 1 byte, 1 cycle; updates C flag
rrca ; 1 byte, 1 cycle; updates C flag
The exception is if you need to set the zero flag when the operation results in 0 for a
; the two-byte operations can set z
, the one-byte operations cannot.
Don't do this:
ld a, [hFoobar] ; 3 bytes, 4 cycles
ld [hFoobar], a ; 3 bytes, 4 cycles
Instead, do this:
ldh a, [hFoobar] ; 2 bytes, 3 cycles
ldh [hFoobar], a ; 2 bytes, 3 cycles
("What's foobar?")
Don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, FOOBAR
sub b
Instead, do this:
; 3 bytes, 3 cycles
cpl
add FOOBAR + 1
(The example sets a
to CVAL
if the carry flag - (c
), or NCVAL
is the carry flag is not set (nc
).)
Don't do this:
; 6 bytes, 6 or 7 cycles
ld a, CVAL
jr c, .carry
ld a, NCVAL
.carry
And don't do this:
; 6 bytes, 6 or 7 cycles
ld a, NCVAL
jr nc, .no_carry
ld a, CVAL
.no_carry
And if either is 0, don't do this:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
xor a
.carry
And if either is 1 more or less than the other, don't do this:
; 5 bytes, 5 cycles
ld a, CVAL ; nor NCVAL
jr c, .carry ; nor jr nc
inc a ; nor dec a
.carry
Instead use sbc a
, which copies the carry flag to all bits of a
. So do this:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL - NCVAL ; $ff becomes CVAL - NCVAL, 0 stays 0
add NCVAL ; CVAL - NCVAL becomes CVAL, 0 becomes NCVAL
Or do this:
; 5 bytes, 5 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ^ NCVAL ; $ff becomes CVAL ^ NCVAL, 0 stays 0
xor NCVAL ; CVAL ^ NCVAL becomes CVAL, 0 becomes NCVAL
And if certain conditions apply, then do something more efficient:
If this case... | ...then do this: |
---|---|
|
; 1 byte, 1 cycle
sbc a ; if carry, then $ff, else 0 |
|
; 2 bytes, 2 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff |
|
; 2 bytes, 2 cycles
sbc a ; if carry, then $ff aka -1, else 0
inc a ; -1 becomes 0, 0 becomes 1 |
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
or NCVAL ; $ff stays $ff, $00 becomes NCVAL |
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff, else 0
and CVAL ; $ff becomes CVAL, 0 stays 0 |
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0
add NCVAL ; -1 becomes NCVAL - 1 aka CVAL, 0 becomes NCVAL |
|
; 3 bytes, 3 cycles
sbc a ; if carry, then $ff aka -1, else 0; doesn't change carry
sbc -NCVAL ; -1 becomes NCVAL - 2 aka CVAL, 0 becomes NCVAL |
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
and NCVAL ; 0 stays 0, $ff becomes NCVAL |
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff
or CVAL ; $00 becomes CVAL, $ff stays $ff |
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if originally carry, then 0, else $ff aka -1
add CVAL ; -1 becomes CVAL - 1 aka NCVAL, 0 becomes CVAL |
|
; 4 bytes, 4 cycles
ccf ; invert carry flag
sbc a ; if carry, then 0, else $ff aka -1; doesn't change carry
sbc -CVAL ; -1 becomes CVAL - 2 aka NCVAL, 0 becomes CVAL |
Don't do this:
; 3 bytes, 3 cycles
jr nc, .ok
inc a
.ok
; 3 bytes, 3 cycles
jr nc, .ok
dec a
.ok
Instead, do this:
adc 0 ; 2 bytes, 2 cycles
sbc 0 ; 2 bytes, 2 cycles
Don't do this:
; 3 bytes, 3 cycles
jr c, .ok
inc a
.ok
; 3 bytes, 3 cycles
jr c, .ok
dec a
.ok
Instead, do this:
sbc -1 ; 2 bytes, 2 cycles
adc -1 ; 2 bytes, 2 cycles
Don't do this:
; 12 bytes, 9 or 10 cycles
cp FOO
jr z, .foo_to_bar
jr .bar_to_foo
.foo_to_bar
ld a, BAR
jr .done
.bar_to_foo
ld a, FOO
.done
...
And don't do this:
; 10 bytes, 7 or 9 cycles
cp FOO
jr z, .foo_to_bar ; nor jr nz, .bar_to_foo
ld a, FOO ; nor ld a, BAR
jr .done
.foo_to_bar ; nor .bar_to_foo
ld a, BAR ; nor ld a, FOO
.done
...
(That would be applying the "Conditional fallthrough" optimization to the first way.)
Instead, do this:
xor FOO ^ BAR ; 2 bytes, 2 cycles
(This works for the same reason as the XOR swap algorithm for swapping the values of two variables.)
Don't do this:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 8 ; divisor
call SimpleDivide
ld a, b ; quotient
And don't do this:
; 6 bytes, 6 cycles
srl a
srl a
srl a
Instead, do this:
; 5 bytes, 5 cycles
rrca
rrca
rrca
and %00011111
Don't do this:
; 6 bytes, 9 cycles
; (15 bytes, at least 21 cycles, counting the definition of SimpleDivide)
ld c, 16 ; divisor
call SimpleDivide
ld a, b ; quotient
And don't do this:
; 8 bytes, 8 cycles
srl a
srl a
srl a
srl a
Instead, do this:
; 4 bytes, 4 cycles
swap a
and $f
(The example uses b
and c
, but any registers besides a
would also work, including [hl]
.)
Don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, c
adc 0
; 4 bytes, 4 cycles
ld b, a
ld a, c
sbc 0
And don't do this:
; 4 bytes, 4 cycles
ld b, a
ld a, 0
adc c
; 4 bytes, 4 cycles
ld b, a
ld a, 0
sbc c
Instead, do this:
; 3 bytes, 3 cycles
ld b, a
adc c
sub b
; 3 bytes, 3 cycles
ld b, a
sbc b
add c
Also, don't do this:
; 5 bytes, 5 cycles
ld b, a
ld a, N
adc 0
; 5 bytes, 5 cycles
ld b, a
ld a, N
sbc 0
And don't do this:
; 5 bytes, 5 cycles
ld b, a
ld a, 0
adc N
; 5 bytes, 5 cycles
ld b, a
ld a, 0
sbc N
Instead, do this:
; 4 bytes, 4 cycles
ld b, a
adc N
sub b
; 4 bytes, 4 cycles
ld b, a
sbc b
add N
(If the original value of a
was not backed up in b
, this optimization would not apply.)
(The example uses b
, but any of c
, d
, e
, h
, l
, or [hl]
would also work.)
Don't do this:
; 4 bytes, 4 cycles
ld a, b
adc 0
ld b, a
; 4 bytes, 4 cycles
ld a, b
sbc 0
ld b, a
And don't do this:
; 4 bytes, 4 cycles
ld a, 0
adc b
ld b, a
; 4 bytes, 4 cycles
ld a, 0
sbc b
ld b, a
Instead, do this:
; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
inc b
.no_carry
; 3 bytes, 3 or 4 cycles
jr nc, .no_carry
dec b
.no_carry
(This optimization is based on Retro Programming).
(The example uses b
, but any of c
, d
, e
, h
, l
, or [hl]
would also work.)
Don't do this:
; 26 bytes, 26 cycles
rept 8
rra ; nor rla
rl b ; nor rr b
endr
ld a, b
And don't do this:
; 17 bytes, 17 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
rlca
rlca
rlca
rrc b
xor b
and $66
xor b
Instead, do this:
; 15 bytes, 15 cycles
ld b, a
rlca
rlca
xor b
and $aa
xor b
ld b, a
swap b
xor b
and $33
xor b
rrca
Or if you really want to optimize for size over speed, then don't do this:
; 10 bytes, 59 cycles
ld bc, 8 ; lb bc, 0, 8
.loop
rra ; nor rla
rl b ; nor rr b
dec c
jr nz, .loop
ld a, b
Instead, do this:
; 8 bytes, 50 cycles
ld b, 1
.loop
rra
rl b
jr nc, .loop
ld a, b
Or if you can spare hl
, then do this:
; 7 bytes, 50 cycles
ld h, a
ld a, $80
.loop
add hl, hl
rra
jr nc, .loop
Or if you really want to optimize for speed over total size, then do this:
; 6 bytes, 12 cycles
; (4 bytes, 5 cycles if you don't need the push hl/pop hl)
push hl
ld h, HIGH(ReversedBitTable)
ld l, a
ld a, [hl]
pop hl
; 256 bytes; placed in ROM0 or the same ROMX section as the bit reversal
SECTION "ReversedBitTable", ROM0, ALIGN[8]
ReversedBitTable::
for x, 256
; http://graphics.stanford.edu/~seander/bithacks.html#ReverseByteWith32Bits
db LOW(((((x * $802) & $22110) | ((x * $8020) & $88440)) * $10101) >> 16)
endr
(This optimization is based on WikiTI).
(The examples count the set bits of c
and also use b
, but any registers besides a
would also work.)
Don't do this:
; 26 bytes, 26 cycles
xor a
ld b, a
rept 8
rrc c
adc b
endr
Instead, do this:
; 20 bytes, 20 cycles
ld a, c
and $aa
cpl
rrca
adc c
ld b, a
and $33
ld c, a
xor b
rrca
rrca
add c
ld c, a
swap a
add c
and $0f
Or if you want to optimize for size over speed, then don't do this:
; 12 bytes, 68 cycles; counts bits in c, uses b
ld a, c
ld bc, $800 ; lb bc, 8, 0
.loop
add a
jr nc, .next
inc c
.next
dec b
jr nz, .loop
ld a, c
But do this:
; 11 bytes, up to 67 cycles; counts bits in c
ld a, c
ld c, 0
.loop
add a
jr nc, .next
inc c
.next
and a
jr nz, .loop
ld a, c
(This optimization is based on Bit Twiddling Hacks).
(The example uses b
, but any of c
, d
, e
, h
, l
, or [hl]
would also work.)
Don't do this:
; 7 bytes, 7 cycles; sets a = (a & MASK) | (b & ~MASK)
and MASK
ld c, a
ld a, b
and ~MASK ; or $ff ^ MASK, or $ff - MASK
or c
Instead, do this:
; 4 bytes, 4 cycles; no third register
xor b
and MASK
xor b
(For example, if MASK
were $f0
, then ~MASK
would be $0f
, and this would merge the high nybble of a
with the low nybble of b
.)
Or if you can spare hl
, then don't do this:
; 9 bytes, 10 cycles; sets a = (a & MASK) | ([wFoobar] & ~MASK)
and MASK
ld c, a
ld a, [wFoobar]
and ~MASK
or c
Instead, do this:
; 7 bytes, 9 cycles; uses hl
ld hl, wFoobar
xor [hl]
and MASK
xor [hl]
Don't do this:
; 4 bytes, 4 cycles
sla l
rl h
Instead, do this:
add hl, hl ; 1 byte, 2 cycles
(The example uses hl
, but bc
or de
would also work.)
Don't do this:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, 0
adc h
ld h, a
And don't do this:
; 6 bytes, 6 cycles
add l
ld l, a
ld a, h
adc 0
ld h, a
And don't do this:
; 5 bytes, 5 cycles
add l
ld l, a
jr nc, .no_carry
inc h
.no_carry
Instead, do this:
; 5 bytes, 5 cycles; no labels
add l
ld l, a
adc h
sub l
ld h, a
Or if you can spare another 16-bit register and want to optimize for size over speed, then do this:
; 4 bytes, 5 cycles
ld d, 0
ld e, a
add hl, de
(The example uses hl
, but bc
or de
would also work.)
Do this:
; 8 bytes, 8 cycles
ld a, l
sub LOW(FooBar)
ld l, a
ld a, h
sbc HIGH(FooBar)
ld h, a
Or if the constant is 8-bit (i.e. HIGH(FooBar) == 0
), then do this:
; 7 bytes, 7 cycles
ld a, l
sub FooBar
ld l, a
jr nc, .no_carry
dec h
.no_carry
(This is a case of "Add or subtract the carry flag from a register besides a
", applied to the high part of a 16-bit register.)
Or if you can spare another 16-bit register, do this:
; 4 bytes, 5 cycles
ld de, -FooBar
add hl, de
(The example uses hl
, but bc
or de
would also work.)
Don't do this:
; 7 bytes, 8 cycles; uses another 16-bit register
ld e, a
ld d, 0
ld hl, FooBar
add hl, de
And don't do this:
; 8 bytes, 8 cycles
ld hl, FooBar
add l
ld l, a
adc h
sub l
ld h, a
And don't do this:
; 8 bytes, 8 cycles
ld h, HIGH(FooBar)
add LOW(FooBar)
ld l, a
jr nc, .no_carry
inc h
.no_carry
Instead, do this:
; 7 bytes, 7 cycles
add LOW(FooBar)
ld l, a
adc HIGH(FooBar)
sub l
ld h, a
Or if the constant is 8-bit and nonzero (i.e. 0 < FooBar
< 256), then do this:
; 6 bytes, 6 cycles
sub LOW(-FooBar)
ld l, a
sbc a
inc a
ld h, a
Or if the constant is zero (i.e. FooBar
== 0 and a
+ FooBar
== a
), then do this:
; 3 bytes, 3 cycles
ld l, a
ld h, 0
(The example uses hl
, but bc
or de
would also work.)
You can do this:
; 7 bytes, 11 cycles
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl
add hl, hl
; 7 bytes, 11 cycles
ld l, a
ld h, 0
rept 4
add hl, hl
endr
But if a
is definitely small enough, and its value can be changed, then do one of these:
; 7 bytes, 10 cycles; sets a = a * 2; requires a < $80
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl
add hl, hl
; 7 bytes, 9 cycles; sets a = a * 4; requires a < $40
add a
add a
ld l, a
ld h, 0
add hl, hl
add hl, hl
; 7 bytes, 8 cycles; sets a = a * 8; requires a < $20
add a
add a
add a
ld l, a
ld h, 0
add hl, hl
; 5 bytes, 5 cycles; sets a = a * 16; requires a < $10
swap a
ld l, a
ld h, 0
Or if the value of a
can be changed and you want to optimize for speed over size, then do one of these:
; 8 bytes, 8 cycles; sets a = l
swap a
ld l, a
and $f
ld h, a
xor l
ld l, a
; 8 bytes, 8 cycles; sets a = h
swap a
ld h, a
and $f0
ld l, a
xor h
ld h, a
(This optimization is based on Plutiedev and GBDev Wiki's ASM Snippets.)
(The example uses hl
, but bc
or de
would also work.)
Don't do this:
; 10 bytes, 9 or 10 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $00
jr c, .ok ; nor jr z, .ok
ld a, $ff
.ok
ld h, a
And don't do these:
; 9 bytes, 8 or 9 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $00
jr c, .ok ; nor jr z, .ok
dec a
.ok
ld h, a
; 9 bytes, 8 or 9 cycles
ld l, a
cp $80 ; nor bit 7, a
ld a, $ff
jr nc, .ok ; nor jr nz, .ok
inc a
.ok
ld h, a
And don't do these:
; 9 bytes, 8 or 9 cycles
ld l, a
rlca ; nor add a
ld a, $00
jr nc, .ok
ld a, $ff
.ok
ld h, a
; 9 bytes, 8 or 9 cycles
ld l, a
rlca ; nor add a
ld a, $ff
jr c, .ok
ld a, $00
.ok
ld h, a
(Those would be applying the "Test whether a
is negative (compare a
to $80)" optimization to the first way.)
And don't do this:
; 6 bytes, 6 cycles
ld l, a
cp $80
ccf
sbc a
ld h, a
(That would be applying the "Set a
to one constant or another depending on the carry flag" optimization to the first way.)
Instead, do this:
; 4 bytes, 4 cycles
ld l, a
rlca ; or add a
sbc a
ld h, a
(That applies both optimizations to the first way.)
When possible, avoid doing this:
inc hl ; 1 byte, 2 cycles
dec hl ; 1 byte, 2 cycles
If the low byte definitely won't overflow, then do this:
inc l ; 1 byte, 1 cycle
dec l ; 1 byte, 1 cycle
This is applicable, for instance, if you're reading a data table via hl
one byte at a time, it has no more than 256 entries, and it's in its own SECTION
which has been ALIGN
ed to 8 bits. It's unlikely to apply to pokecrystal's existing systems.
(The example uses hl
, but bc
or de
would also work.)
Don't do this:
; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
ld a, h ; nor ld a, 0
adc 0 ; nor adc h
ld h, a
; 8 bytes, 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
ld a, h ; nor ld a, 0
sbc 0 ; nor sbc h
ld h, a
And don't do this:
; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
adc h
sub l
ld h, a
; 7 bytes, 7 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
sbc h
add l
ld h, a
(That would be applying the "Set a
to some value plus or minus carry" optimization to part of the first way.)
And don't do this:
; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
adc 0 ; nor adc l
ld l, a
jr nc, .no_carry
inc h
.no_carry
; 7 bytes, 7 or 8 cycles
ld a, l ; nor ld a, 0
sbc 0 ; nor sbc l
ld l, a
jr nc, .no_carry
dec h
.no_carry
(That would be applying the "Add or subtract the carry flag from a register besides a
" optimization to part of the first way.)
Instead, do this:
; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
inc hl
.no_carry
; 3 bytes, 4 or 5 cycles
jr nc, .no_carry
dec hl
.no_carry
Don't do this:
; 8 bytes, 10 cycles
ld a, [wFoobar] ; LSB first
ld l, a
ld a, [wFoobar+1]
ld h, a
Instead, do this:
; 6 bytes, 8 cycles
ld hl, wFoobar
ld a, [hli]
ld h, [hl]
ld l, a
And don't do this:
; 8 bytes, 10 cycles
ld a, [wFoobar] ; MSB first
ld h, a
ld a, [wFoobar+1]
ld l, a
Instead, do this:
; 6 bytes, 8 cycles
ld hl, wFoobar
ld a, [hli]
ld l, [hl]
ld h, a
Don't do this:
; 9 bytes, 12 cycles
ld a, [wFoobar]
ld l, a
ld a, [wFoobar+1]
ld h, a
ld sp, hl
And don't do this:
; 7 bytes, 10 cycles
ldh a, [hFoobar]
ld l, a
ldh a, [hFoobar+1]
ld h, a
ld sp, hl
And don't do this:
; 7 bytes, 10 cycles
ld hl, wFoobar
ld a, [hli]
ld h, [hl]
ld l, a
ld sp, hl
(That would be applying the "Load from an address to hl
" optimization to the first way.)
Instead, do this:
; 5 bytes, 8 cycles
ld sp, wFoobar
pop hl
ld sp, hl
Or if the address is already in hl
, then don't do this:
; 4 bytes, 7 cycles
ld a, [hli]
ld h, [hl]
ld l, a
ld sp, hl
Instead, do this:
; 3 bytes, 7 cycles
ld sp, hl
pop hl
ld sp, hl
(The example uses hl
and de
, but any pair of bc
, de
, or hl
would also work.)
If you care about speed, then do this:
; 6 bytes, 6 cycles
ld a, d
ld d, h
ld h, a
ld a, e
ld e, l
ld l, a
If you care about size, then do this:
; 4 bytes, 9 cycles
push de
ld d, h
ld e, l
pop hl
(The example uses hl
and de
, but any pair of bc
, de
, or hl
would also work.)
Don't do this:
; 9 bytes, 10 cycles; modifies subtrahend de
ld a, $ff
xor d
ld d, a
ld a, $ff
xor e
ld e, a
add hl, de
And don't do this:
; 7 bytes, 8 cycles; modifies subtrahend de
ld a, d
cpl
ld d, a
ld a, e
cpl
ld e, a
add hl, de
Instead, do this:
; 6 bytes, 6 cycles
ld a, l
sub e
ld l, a
ld a, h
sbc d
ld h, a
(The example uses bc
, but hl
or de
would also work.)
Don't do this:
; 4 bytes, 4 cycles
ld b, FOO
ld c, BAR
Instead, do this:
ld bc, FOO << 8 | BAR ; 3 bytes, 3 cycles
Or better, use the lb
macro in macros/code.asm:
lb bc, FOO, BAR ; 3 bytes, 3 cycles
Don't do this:
; 3 bytes, 4 cycles
ld a, FOOBAR
ld [hl], a
Instead, do this:
ld [hl], FOOBAR ; 2 bytes, 3 cycles
Don't do this:
; 3 bytes, 5 cycles
ld a, [hl]
inc a
ld [hl], a
; 3 bytes, 5 cycles
ld a, [hl]
dec a
ld [hl], a
Instead, do this:
inc [hl] ; 1 bytes, 3 cycles
dec [hl] ; 1 bytes, 3 cycles
Don't do this:
; 2 bytes, 4 cycles
ld [hl], a
inc hl
; 2 bytes, 4 cycles
ld [hl], a
dec hl
Instead, do this:
ld [hli], a ; 1 bytes, 2 cycles
ld [hld], a ; 1 bytes, 2 cycles
And if you can use a
, then don't do this:
; 3 bytes, 5 cycles
ld [hl], FOO
inc hl
; 3 bytes, 5 cycles
ld [hl], FOO
dec hl
Instead, do this:
; 3 bytes, 4 cycles
ld a, FOO
ld [hli], a
; 3 bytes, 4 cycles
ld a, FOO
ld [hld], a
(The example uses b
, but any of c
, d
, e
, h
, or l
would also work.)
Do this:
; 2 bytes, 4 cycles
ld [hl], b
inc hl
; 2 bytes, 4 cycles
ld [hl], b
dec hl
Or if you can use a
, then do this:
; 2 bytes, 3 cycles
ld a, b
ld [hli], a
; 2 bytes, 3 cycles
ld a, b
ld [hld], a
Don't do this:
jp Somewhere ; 3 bytes, 4 cycles
Instead, do this:
jr Somewhere ; 2 bytes, 3 cycles
This only applies if Somewhere
is within ±128 bytes of the jump.
You can define a jmp
macro to use instead of jp
, which will warn you when it can be jr
instead:
MACRO jmp
if _NARG == 1
jp \1
else
jp \1, \2
shift
endc
assert warn, (\1) - @ > 127 || (\1) - @ < -129, "jp can be jr"
ENDM
Don't do this:
cp 0 ; 2 bytes, 2 cycles
And don't do this:
or 0 ; 2 bytes, 2 cycles
And don't do this:
and $ff ; 2 bytes, 2 cycles
Instead, do this:
or a ; 1 byte, 1 cycle
Or do this:
and a ; 1 byte, 1 cycle
Do this:
cp 1 ; 2 bytes, 2 cycles; updates Z and C flags
Or if you don't care about the value in a
, and don't need to set the carry flag, then do this:
dec a ; 1 byte, 1 cycle; decrements a, updates Z flag
Note that you can still do inc a
afterwards, which is one cycle faster if the jump is taken. Compare this:
; 4 bytes, 4 or 5 cycles
cp 1
jr z, .equals1
with this:
; 4 bytes, 4 cycles
dec a
jr z, .equals1
inc a
(255, or $FF in hexadecimal, is the same as −1 due to two's complement.)
Do this:
cp $ff ; 2 bytes, 2 cycles; updates Z and C flags
Or if you don't care about the value in a
, and don't need to set the carry flag, then do this:
inc a ; 1 byte, 1 cycle; increments a, updates Z flag
Note that you can still do dec a
afterwards, which is one cycle faster if the jump is taken. Compare this:
; 4 bytes, 4 or 5 cycles
cp $ff
jr z, .equals255
with this:
; 4 bytes, 4 cycles
inc a
jr z, .equals255
dec a
Don't do this:
; 3 bytes, 3 cycles; sets zero flag if a == 0
and MASK
and a
Instead, do this:
and MASK ; 2 bytes, 2 cycles; sets zero flag if a == 0
Don't do this:
; 4 bytes, 4 cycles; sets zero flag if a == MASK and carry flag if a < MASK
and MASK
cp MASK
If you don't need to set the carry flag, and don't need the masked value of a
, then do this:
; 3 bytes, 3 cycles; sets zero flag if a was equal to MASK
or ~MASK ; or $ff ^ MASK, or $ff - MASK
inc a
If you don't need to preserve the value in a
, then don't do this:
; 4 bytes, 4 or 5 cycles
cp $80
jr nc, .negative
And don't do this:
; 4 bytes, 4 or 5 cycles
bit 7, a
jr nz, .negative
Instead, do this:
; 3 bytes, 3 or 4 cycles; modifies a
rlca ; or add a
jr c, .negative
Don't do this:
; 4 bytes, 10 cycles
call Function
ret
Instead, do this:
jp Function ; 3 bytes, 4 cycles
Don't do this:
; 5 bytes, 8 cycles
(some code)
ld de, .return
push de
jp hl
.return:
(some more code)
Instead, do this:
; 3 bytes, 6 cycles
; (4 bytes, 7 cycles, counting the definition of _hl_)
(some code)
call _hl_
(some more code)
_hl_
is a routine already defined in home/call_regs.asm:
_hl_::
jp hl
Don't do this:
; 4 additional bytes, 10 additional cycles
(some code)
call Function
(some more code)
Function:
(function code)
ret
if Function
is only called a handful of times. Instead, do:
(some code)
; Function
(function code)
(some more code)
You shouldn't do this if Function
used any ret
urns besides the one at the very end, or if inlining its code would make some jr
s too distant from their targets.
Don't do this:
(some code)
call Function
ret
Function:
(function code)
ret
And don't do this:
(some code)
jp Function
Function:
(function code)
ret
Instead, do this:
(some code)
; fallthrough
Function:
(function code)
ret
Fallthrough is what you get when you combine inlining with tail calls. You can still call Function
elsewhere, but one tail call can be optimized into a fallthrough.
(The example uses z
, but nz
, c
, or nc
would also work.)
Don't do this:
(some code)
jr z, .foo
jr .bar
.foo
(foo code)
.bar
(bar code)
Instead, do this:
(some code)
jr nz, .bar
; fallthrough
.foo
(foo code)
.bar
(bar code)
(The example uses z
, but nz
, c
, or nc
would also work.)
Don't do this:
; 3 bytes, 3 or 6 cycles
jr z, .skip
ret
.skip
...
And don't do this:
; 3 bytes, 7 or 2 cycles
jr nz, .return
...
.return
ret
Instead, do this:
; 1 byte, 5 or 2 cycles
ret nz
...
(The example uses z
, but nz
, c
, or nc
would also work.)
Don't do this:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
call Foo
.skip
Instead, do this:
; 3 bytes, 6 or 3 cycles
call z, Foo
And don't do this:
; 5 bytes, 3 or 9 cycles
jr nz, .skip
jp Foo
.skip
Instead, do this:
; 3 bytes, 6 or 3 cycles
jp z, Foo
(The example uses z
, but nz
, c
, or nc
would also work.)
Don't do this:
; 5 bytes, 3 or 14 cycles
call z, RstVector38
...
RstVector38:
rst $38
ret
And don't do this:
; 3 bytes, 3 or 6 cycles
jr nz, .no_rst_38
rst $38
.no_rst_38
...
And don't do this:
; 3 bytes, 3 or 6 cycles
call z, $0038
...
Instead, do this:
; 2 bytes, 2 or 7 cycles
jr z, @ + 1 ; the byte for @ + 1 is $ff, which is the opcode for rst $38
...
(The label @
evaluates to the current pc
value, which in jr z, @ + 1
is right before the jr
instruction. The instruction consists of two bytes, the opcode and the relative offset. @ + 1
evaluates to in-between those two bytes. The jr
instruction encodes its offset relative to the end of the instruction, i.e. the next pc
value after the instruction has been read, so the relative offset is -1
, aka $ff
.)
Don't do this:
; 2 bytes, 5 cycles
ei
ret
Instead, do this:
; 1 byte, 4 cycles
reti
Don't do this:
cp 1
jr z, .equals1
cp 2
jr z, .equals2
cp 3
jr z, .equals3
...
Instead, do this:
dec a
jr z, .equals1
dec a
jr z, .equals2
dec a
jr z, .equals3
...
Or do this:
dec a
ld hl, .jumptable
ld e, a
ld d, 0
add hl, de
add hl, de
ld a, [hli]
ld h, [hl]
ld l, a
jp hl
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...
Or better, do:
dec a
ld hl, .jumptable
rst JumpTable
ret
.jumptable:
dw .equals1
dw .equals2
dw .equals3
...
JumpTable
is an rst
routine already defined in home/header.asm:
JumpTable::
push de
ld e, a
ld d, 0
add hl, de
add hl, de
ld a, [hli]
ld h, [hl]
ld l, a
pop de
jp hl
Don't do this:
ld hl, Foo
ld bc, BAR
dec a
call AddNTimes
Instead, as long as you don't need to add 255 times when a is 0, then do this:
ld hl, Foo - BAR
ld bc, BAR
call AddNTimes