You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Observations About Perceptron (Threshold) Networks
A threshold is if x > 0.0 { 1.0 } else { 0.0 }, actually yielding only 1 bit of information.
Neurons take in these inputs pre-weighted, and then apply a new threshold. This is actually a boolean expression, they are (expensively) doing a few operations on few bits of information (!)
Networks with threshold activation could actually drop all math after the input weighting, and all neuron sums could be treated instead like boolean operations.
The optimization I have in mind is to represent the N inputs to a neuron as a bitstring (broken into registers or fixed-size array of register-sized integers). You then can use bitwise masking operations to produce output which is equivalent to the naive weighted-sum-and-threshold implementation.
We are simply yielding a single bit (is_non_negative), and it is possible to reason about the inputs which can yield the two values.
Sign Activation
A sign activation does not yield 1 bit of information.
It has 3 output states, and yields log_2(3) = 1.5849625 bits of information, let's call it 2 bits unless we can find a perfectly optimal packing of the bits into a bitstring that allows for efficient manipulation.
The optimization could be performed on sign activations as well, although the problem size grows larger and the results will not be as fast generally speaking, as you'll probably need more bitwise operations to produce the same results.
Application Heuristic
So, this may all be screwed up by some modern CPUs, but let's analyze in a simplistic way, assuming an equal cost for floating-point and non-arithmetic bitwise operations (not realistic!):
Number of floating point operations in naive impl: 2N (N multiplications, N additions, constant cost for branching / thresholding)
Bitwise operations to get same result: K * N / R,
K: some minimal number of bitwise operations to get the same effect for these particular weights
N: number of inputs
R: size of the biggest suitable register available on the hardware (8 bits, 64 bits, 512 bits, etc).
All of the gain in this model is from the bit-level parallelism (you can do 64-input neuron in K operations over 64-bit register for instance).
Setting KN/R < 2N == K/R < 2 yields a boundary for using this technique (in terms of instruction count, again under-modeled). For my RP2040, optimization might be fruitful on this architecture so long as this inequality is respected (measurements would be better than this, but it's an ok mental model). At max capacity (32 inputs), it certainly seems like doing a few bitwise operations would be better than doing 32 floating point multiplications and sums, especially because it does not exist in hardware for a ton of embedded targets.
Perfect Performance
I'm thinking it may be fruitful to use a boolean expression simplifier, and represent the entire network (except the inputs nodes) as a boolean expression, and have the entire thing simplified with global information. There are likely tautologies and recomputations contained in the internal network, and they should all be eliminated by this approach.
Horrible Problems with Perfect Performance
There are several problems with this:
Boolean simplification is exponential in time in the worst case (!)
The result of the simplification will be a "simplified" boolean expression, but what I actually want is a sequence of bitwise operations that are efficient to perform on particular hardware. Fewer operations is good, and may yield more efficient bitwise operations, but they are not the same. This sequence must be synthesized after the fact. I do not think LLVM will be able to help with this task. To complicate matters more, evolved networks tend to be irregular, lacking a clean set of layers. If you could evaluate two neurons simultaneously with different bit portions of the same register, this would improve efficiency further. This is hard for me to design by hand let alone in an automated way.
bit-level parallelism is always provided by the hardware for these operations, but I may or may not be able to capture this fact in the boolean expression (the fact that these K adjacent bits are bundled together).
The arrangement of the bits inside the registers will have an impact on how simple the operations become. A bad arrangement may require additional masking and shifting, while perhaps a good arrangement would require fewer steps.
No one ever said we absolutely needed to yield 1 bit when x > 0, and a 0 bit otherwise. Reversing these inside the network could make some optimizations more fruitful, so long as the network has the same behavior.
I am using this issue to document my thoughts on the problem, and may or may not implement the feature.
The text was updated successfully, but these errors were encountered:
Observations About Perceptron (Threshold) Networks
if x > 0.0 { 1.0 } else { 0.0 }
, actually yielding only 1 bit of information.The optimization I have in mind is to represent the N inputs to a neuron as a bitstring (broken into registers or fixed-size array of register-sized integers). You then can use bitwise masking operations to produce output which is equivalent to the naive weighted-sum-and-threshold implementation.
We are simply yielding a single bit (
is_non_negative
), and it is possible to reason about the inputs which can yield the two values.Sign Activation
log_2(3) = 1.5849625
bits of information, let's call it 2 bits unless we can find a perfectly optimal packing of the bits into a bitstring that allows for efficient manipulation.Application Heuristic
So, this may all be screwed up by some modern CPUs, but let's analyze in a simplistic way, assuming an equal cost for floating-point and non-arithmetic bitwise operations (not realistic!):
Number of floating point operations in naive impl:
2N
(N
multiplications,N
additions, constant cost for branching / thresholding)Bitwise operations to get same result:
K * N / R
,K
: some minimal number of bitwise operations to get the same effect for these particular weightsN
: number of inputsR
: size of the biggest suitable register available on the hardware (8 bits, 64 bits, 512 bits, etc).K
operations over 64-bit register for instance).Setting
KN/R < 2N
==K/R < 2
yields a boundary for using this technique (in terms of instruction count, again under-modeled). For myRP2040
, optimization might be fruitful on this architecture so long as this inequality is respected (measurements would be better than this, but it's an ok mental model). At max capacity (32 inputs), it certainly seems like doing a few bitwise operations would be better than doing 32 floating point multiplications and sums, especially because it does not exist in hardware for a ton of embedded targets.Perfect Performance
I'm thinking it may be fruitful to use a boolean expression simplifier, and represent the entire network (except the inputs nodes) as a boolean expression, and have the entire thing simplified with global information. There are likely tautologies and recomputations contained in the internal network, and they should all be eliminated by this approach.
Horrible Problems with Perfect Performance
There are several problems with this:
1
bit whenx > 0
, and a0
bit otherwise. Reversing these inside the network could make some optimizations more fruitful, so long as the network has the same behavior.I am using this issue to document my thoughts on the problem, and may or may not implement the feature.
The text was updated successfully, but these errors were encountered: