Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Threshold no-math optimization #3

Open
wbrickner opened this issue May 18, 2022 · 0 comments
Open

Threshold no-math optimization #3

wbrickner opened this issue May 18, 2022 · 0 comments
Labels
enhancement New feature or request

Comments

@wbrickner
Copy link
Owner

wbrickner commented May 18, 2022

Observations About Perceptron (Threshold) Networks

  • A threshold is if x > 0.0 { 1.0 } else { 0.0 }, actually yielding only 1 bit of information.
  • Neurons take in these inputs pre-weighted, and then apply a new threshold. This is actually a boolean expression, they are (expensively) doing a few operations on few bits of information (!)
  • Networks with threshold activation could actually drop all math after the input weighting, and all neuron sums could be treated instead like boolean operations.

The optimization I have in mind is to represent the N inputs to a neuron as a bitstring (broken into registers or fixed-size array of register-sized integers). You then can use bitwise masking operations to produce output which is equivalent to the naive weighted-sum-and-threshold implementation.

We are simply yielding a single bit (is_non_negative), and it is possible to reason about the inputs which can yield the two values.

Sign Activation

  • A sign activation does not yield 1 bit of information.
  • It has 3 output states, and yields log_2(3) = 1.5849625 bits of information, let's call it 2 bits unless we can find a perfectly optimal packing of the bits into a bitstring that allows for efficient manipulation.
  • The optimization could be performed on sign activations as well, although the problem size grows larger and the results will not be as fast generally speaking, as you'll probably need more bitwise operations to produce the same results.

Application Heuristic

So, this may all be screwed up by some modern CPUs, but let's analyze in a simplistic way, assuming an equal cost for floating-point and non-arithmetic bitwise operations (not realistic!):

  • Number of floating point operations in naive impl: 2N (N multiplications, N additions, constant cost for branching / thresholding)

  • Bitwise operations to get same result: K * N / R,

    • K: some minimal number of bitwise operations to get the same effect for these particular weights
    • N: number of inputs
    • R: size of the biggest suitable register available on the hardware (8 bits, 64 bits, 512 bits, etc).
    • All of the gain in this model is from the bit-level parallelism (you can do 64-input neuron in K operations over 64-bit register for instance).
  • Setting KN/R < 2N == K/R < 2 yields a boundary for using this technique (in terms of instruction count, again under-modeled). For my RP2040, optimization might be fruitful on this architecture so long as this inequality is respected (measurements would be better than this, but it's an ok mental model). At max capacity (32 inputs), it certainly seems like doing a few bitwise operations would be better than doing 32 floating point multiplications and sums, especially because it does not exist in hardware for a ton of embedded targets.

Perfect Performance

I'm thinking it may be fruitful to use a boolean expression simplifier, and represent the entire network (except the inputs nodes) as a boolean expression, and have the entire thing simplified with global information. There are likely tautologies and recomputations contained in the internal network, and they should all be eliminated by this approach.

Horrible Problems with Perfect Performance

There are several problems with this:

  1. Boolean simplification is exponential in time in the worst case (!)
  2. The result of the simplification will be a "simplified" boolean expression, but what I actually want is a sequence of bitwise operations that are efficient to perform on particular hardware. Fewer operations is good, and may yield more efficient bitwise operations, but they are not the same. This sequence must be synthesized after the fact. I do not think LLVM will be able to help with this task. To complicate matters more, evolved networks tend to be irregular, lacking a clean set of layers. If you could evaluate two neurons simultaneously with different bit portions of the same register, this would improve efficiency further. This is hard for me to design by hand let alone in an automated way.
  3. bit-level parallelism is always provided by the hardware for these operations, but I may or may not be able to capture this fact in the boolean expression (the fact that these K adjacent bits are bundled together).
  4. The arrangement of the bits inside the registers will have an impact on how simple the operations become. A bad arrangement may require additional masking and shifting, while perhaps a good arrangement would require fewer steps.
  5. No one ever said we absolutely needed to yield 1 bit when x > 0, and a 0 bit otherwise. Reversing these inside the network could make some optimizations more fruitful, so long as the network has the same behavior.

I am using this issue to document my thoughts on the problem, and may or may not implement the feature.

@wbrickner wbrickner added the enhancement New feature or request label May 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant