Skip to content

Commit

Permalink
Add development guide
Browse files Browse the repository at this point in the history
  • Loading branch information
majianjia committed Jun 18, 2019
1 parent 099778f commit 776bc91
Show file tree
Hide file tree
Showing 5 changed files with 168 additions and 44 deletions.
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,8 @@ There is no need to learn TensorFlow/Lite or other libs.

[5 min to NNoM Guide](docs/guide_5_min_to_nnom.md)

[Development Guide](docs/guide_development.md)

[The temporary guide](docs/A_Temporary_Guide_to_NNoM.md)

[Porting and optimising Guide](docs/Porting_and_Optimisation_Guide.md)
Expand Down
Binary file modified docs/figures/nnom_structures.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
207 changes: 163 additions & 44 deletions docs/guide_development.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,87 @@
# Development Guide

> This guide is to give a detail instruction to NNoM.
>
Currently, it is not yet a "guide". At least, it provides some further information.

## Introduction
-----

## Frequent Questions and Answers (Q&A)

### What is NNoM different from others?

NNoM is a higher level inference framework. The most obvious feature is the human understandable interface.
NNoM is a higher-level inference framework. The most obvious feature is the human understandable interface.

It is also a layer-based framework, instead of a operator-based. A layer might contains a few operators.
- It is also a layer-based framework, instead of operator-based. A layer might contain a few operators.

It natively supports complex model structure. High-efficency network always benefited from complex structure.
- It natively supports complex model structure. High-efficiency network always benefited from complex structure.

It provide layer-to-layer analysis to help developer optimize their models.
- It provides layer-to-layer analysis to help developer optimize their models.

### Develop ad-hoc model vs. use pre-trained models.
### Should I develop an ad-hoc model or use a pre-trained model?

The famous pre-trained models are more for the image processing side.
They are effecient on such mobile phones.
But they are still too buckly if the MCU didn't fit with at least 250K RAM and a hardware Neural Network Accelorator.
They are efficient on such mobile phones.
But they are still too bulky if the MCU doesn't provide at least 250K RAM and a hardware Neural Network Accelerator.

> *MobileNet V1 model with depth multi-plier (0.25x) ... STM32 F746 ... CMSIS-NN kernels to program the depthwise and pointwise convolutions ... approximately 0.75 frames/sec*
> MobileNet V1 model with depth multi-plier (0.25x) ... STM32 F746 ... CMSIS-NN kernels to program the depthwise and pointwise convolutions ... approximately 0.75 frames/sec
Source: [Visual Wake Words Dataset](https://arxiv.org/abs/1906.05721)

However, MCUs should not really do the image processing. The data they normally process are normally not visual but other time sequance measurement.
For example, the accelerometer data consist of 3 axis (channel) measurement per timestamp.
In most cases, MCUs should not really do image processing without hardware accelerator. The data they normally process a few channels of time sequence measurement.
For example, the accelerometer data consist of 3-axis (channel) measurement per timestamp.

Dealing with these data, building the ad-hoc models for each application is the only option.

Building an ad-hoc model is sooo easy with NNoM since most of the codes are automative generated.

Building an ad-hoc model is sooo easy with NNoM since most of the codes are automatically generated.

### What can NNoM provide to embedded engineers?
It provide an **easy to use** and **easy to evaluate** inference tools for fast neural network development.
It provides an **easy to use** and **easy to evaluate** inference tools for fast neural network development.

As embedded engineers, we might not know well how does neural network work and how can we optimize it for the MCU.

NNoM together with Keras can help you to start practicing within half an hour. There is no need to learn other ML lib from scratch. Deployment can be done with one line of python code after you have trained an model using Keras.
NNoM together with Keras can help you to start practising within half an hour. There is no need to learn other ML libs from scratch. Deployment can be done with one line of python code after you have trained a model using Keras.

Other than building a model, NNoM also provides a set of evaluation methods. These evaluation methods will give the developer a layer-to-layer performance evaluation of the model.

Developers can then modify the ad-hoc model to increase efficency or to lower the memory cost.
(Please check the following Performance sections for detials.)
Developers can then modify the ad-hoc model to increase efficiency or to lower the memory cost.
(Please check the following Performance sections for detail.)

-----

## NNoM Structure

As mentioned in many other docs, NNoM uses a layer-based structure.
The most benefit is the model structure can be directly seemed from the codes.
The most benefit is the model structure can seem directly from the codes.

It also makes the model conversion from other layer-based libs (Keras, TensorLayer, Caffe) to NNoM model very straight forward. When use `generate_model(model, x_test, name='weights.h')` to generate NNoM model, it simply read the configuration out and rewrite it to C codes.

Structure:
![](figures/nnom_structures.png)

It also makes the model convertion from other layer-based lib (Keras, TensorLayer, Caffe) to NNoM model very straight forward.
NNoM uses a compiler to manage the layer structure and other resources. After compiling, all layers inside the model will be put into a shortcut list per the running order. Besides that, arguments will be filled in and the memory will be allocated to each layer (Memory are reused in between layers). Therefore, no memory allocation performed in the runtime, performance is the same as running backend function directly.

NNoM uses a compiler to manage the layer structure and other resources. After compiling, all layers inside the model will be put into a shortcut list per the running order. Beside that, arguments will be filled in and the memory will be allocated to each layer (Memory are reused in between layers). Therefore, no memory allocation performed in the runtime, performance is the same as running backend function directly.
The NNoM is more on managing the higher-level structure, context argument and memory. The actual arithmetics are done by the backend functions.

The NNoM is more on manage the higher-level structure, context argument and memory. The actual arithmetics are done by the backend functions.
Currently, NNoM supports a pure C backend and CMSIS-NN backend.
The CMSIS-NN is a highly optimized low-level NN core for ARM-Cortex-M microcontroller.
Please check the [optimization guide](Porting_and_Optimisation_Guide.md) for utilisation.

Currently, NNoM support a pure C backend and CMSIS-NN backend.
The CMSIS-NN is an highly optimized low-level NN core for ARM-Cortex-M microcontroller.
Please check the [optimize guide](Porting_and_Optimisation_Guide.md) for utilisation.

## Optimation
-----
## Quantisation

The CMSIS-NN can provide upto 5 times performance comparing to the pure C backend on Cortex-M MCUs. It maximises the performance by using SIMD and other instructions(__SSAT, ...).
NNoM currently only support 8bit weights and 8bit activations. The model will be quantised through model conversion `generate_model(model, x_test, name='weights.h')`.

These optimizations come with different constrains. This is why CMSIS-NN provides many variances to one operators (such as 1x1 convolution, RGB convolution, none-square/square, they are all convolution only with different routines).
The input data (activations) will need to be quantised then feed to the model.

NNoM will automaticly select the best operator for the layer when it is available. Sometime, it is not possible to use CMSIS-NN because the condition is not met. CMSIS-NN provides a subset operator to the local pure C backend. When it is not possible to use CMSIS-NN, NNoM will run the layer using the C backend end instead. It is vary from layer to layer whether use CMSIS-NN or C backend.
-----

## Optimization

The CMSIS-NN can provide up to 5 times performance compared to the pure C backend on Cortex-M MCUs. It maximises the performance by using SIMD and other instructions(__SSAT, ...).

These optimizations come with different constraints. This is why CMSIS-NN provides many variances to one operator (such as 1x1 convolution, RGB convolution, none-square/square, they are all convolution only with different routines).

NNoM will automatically select the best operator for the layer when it is available. Sometimes, it is not possible to use CMSIS-NN because the condition is not met. CMSIS-NN provides a subset operator to the local pure C backend. When it is not possible to use CMSIS-NN, NNoM will run the layer using the C backend end instead. It varies from layer to layer whether to use CMSIS-NN or C backend.

The example condition for convolutions are list below:

Expand All @@ -80,41 +96,117 @@ Some of them can be further optimized by square shape, however, the optimization

> Trick, if you keep the channel size is a multiple of 4, it should work in most of the case.
If you are not sure whether the optimization is working, simply us the `model_stat()` in [Evaluation API](api_evaluation.md) to print the performance of each layer. Comparison will be shown in the following sections.
If you are not sure whether the optimization is working, simply us the `model_stat()` in [Evaluation API](api_evaluation.md) to print the performance of each layer. The comparison will be shown in the following sections.

Fully connected layers and pooling layers are less constrained.

-----

## Performance

Performances are vary.
Performances vary from chip to chip.
Efficiencies are more constant.

We can use *Multiply–accumulate operation (MAC) per Hz (MACops/Hz)* to evaluate the efficency.
We can use *Multiply–accumulate operation (MAC) per Hz (MACops/Hz)* to evaluate the efficiency.
It simply means how many MAC can be done in one cycle.

Currently, NNoM only count MAC operations on Convolution layers and Dense layers since other layers (pooling, padding) are much lesser.

Running an model on CMSIS-NN and NNoM will have the same performance, when a model is fully compliant with CMSIS-NN and running on Cortex-M4/7/33/35P. ("compliant" means it meets the optimization condition in above discussion).
Running a model on CMSIS-NN and NNoM will have the same performance when a model is fully compliant with CMSIS-NN and running on Cortex-M4/7/33/35P. ("compliant" means it meets the optimization condition in the above discussion).

For example, in [CMSIS-NN paper](https://arxiv.org/pdf/1801.06601), the authors used an STM32F746@216MHz to run a model with 24.7M(MACops) tooks 99.1ms in total.
For example, in [CMSIS-NN paper](https://arxiv.org/pdf/1801.06601), the authors used an STM32F746@216MHz to run a model with `24.7M(MACops)` took `99.1ms` in total.

The runtime of each layer were recorded. What hasn't been shown in the paper is this table. (refer to Table 1 in the paper)
The runtime of each layer was recorded. What hasn't been shown in the paper is this table. (refer to Table 1 in the paper)

| |Layer|Input ch|output ch|Ops|Runtime|Efficiency|
| |Layer|Input ch|output ch|Ops|Runtime|Efficiency (MACops/Hz)|
|-------|-------|-------|-----|-------|----|-------|
|Layer 1| Convolution|3|32|4.9M|31.4ms| |
|Layer 3| Convolution|32|32|13.1M|42.8ms||
|Layer 5| Convolution|32|64|6.6M|22.6ms||
|Layer 7| Fully-connected|1024|10|20k|0.1ms||
|Total| |||24.7M|99.1||
|Layer 1|Conv|3|32|4.9M|31.4ms| 0.36|
|Layer 3|Conv|32|32|13.1M|42.8ms|0.71|
|Layer 5|Conv|32|64|6.6M|22.6ms|0.68|
|Layer 7|Dense|1024|10|20k|0.1ms|0.93|
|Total| |||24.7M|99.1ms|0.58|

> *ops = 2 x MACops,*
> *total is less due to other layers such as activation and pooling, please check the paper for full table*

In the table, layer 3 and 5 are both Convolution layer with input and output channel size equal to a multiple of 4. Layer 1 with input channel = 3.

You can already see the efficiency difference. When input channel = 3, the convolution is performed by `arm_convolve_HWC_q7_RGB()`. This method is partially optimized since the input channel is not a multiple of 4, While Layer 3 and layer 5 are fully optimized. The efficiency difference is already huge (`0.36` vs `0.71/0.68`).

To achieve high efficiency, you should keep both input channel = a multiple of 4 and output is a multiple of 2.

What does this number mean? You can estimate your runtime while designing your ad-hoc model.

In typical applications:

[Use motion sensor to recognise human activity](https://github.com/majianjia/nnom/tree/master/examples/uci-inception). A model takes `9` channels time sequence data, `0.67M MACops`, STM32F746 will take around `0.67M/0.58/216MHz = 5.3ms` to do one inference.

[Use microphone to spot key-word commands](https://github.com/majianjia/nnom/tree/master/examples/keyword_spotting). A model takes `63 x 12 x 1` MFCC data, `2.09M MACops`, STM32F746 will take around `2.09M/0.58/216MHz = 16.7ms` to do one inference.

> *Notes, MACops/Hz in NNoM is less than the CMSIS-NN in the paper, this is because NNoM considers the operator and its following activation as one single layer. For example, the running time cost by the convolution layer is can be the time cost by `operator(Conv) + activation(ReLU)`.*
-----

## Evaluations

Evaluation is equally important to building the model.

In NNoM, we provide a few different methods to evaluate the model. The details are list in [Evaluation Methods](api_nnom_utils.md).
If your system support print through a console (such as serial port), the evaluation can be printed on the console.

Firstly, the model structure is printed during compiling in `model_compile()`, which is normally called in `nnom_model_create()`.

Secondly, the runtime performance is printed by `model_stat()`.

Thirdly, there is a set of `prediction_*()` APIs to validate a set of testing data and print out Top-K accuracy, confusion matrix.

### An NNoM model

This is what a typical model looks like in the `weights.h` or `model.h` or whatever you name is. These codes are generated by the script.
In user's `main()`, call `nnom_model_create()` will create and compile the model.

~~~
...
/* nnom model */
static int8_t nnom_input_data[784];
static int8_t nnom_output_data[10];
static nnom_model_t* nnom_model_create(void)
{
static nnom_model_t model;
nnom_layer_t* layer[20];
new_model(&model);
layer[0] = Input(shape(28, 28, 1), nnom_input_data);
layer[1] = model.hook(Conv2D(12, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_1_w, &conv2d_1_b), layer[0]);
layer[2] = model.active(act_relu(), layer[1]);
layer[3] = model.hook(MaxPool(kernel(2, 2), stride(2, 2), PADDING_SAME), layer[2]);
layer[4] = model.hook(Cropping(border(1,2,3,4)), layer[3]);
layer[5] = model.hook(Conv2D(24, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_2_w, &conv2d_2_b), layer[4]);
layer[6] = model.active(act_relu(), layer[5]);
layer[7] = model.hook(MaxPool(kernel(4, 4), stride(4, 4), PADDING_SAME), layer[6]);
layer[8] = model.hook(ZeroPadding(border(1,2,3,4)), layer[7]);
layer[9] = model.hook(Conv2D(24, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_3_w, &conv2d_3_b), layer[8]);
layer[10] = model.active(act_relu(), layer[9]);
layer[11] = model.hook(UpSample(kernel(2, 2)), layer[10]);
layer[12] = model.hook(Conv2D(48, kernel(3, 3), stride(1, 1), PADDING_SAME, &conv2d_4_w, &conv2d_4_b), layer[11]);
layer[13] = model.active(act_relu(), layer[12]);
layer[14] = model.hook(MaxPool(kernel(2, 2), stride(2, 2), PADDING_SAME), layer[13]);
layer[15] = model.hook(Dense(64, &dense_1_w, &dense_1_b), layer[14]);
layer[16] = model.active(act_relu(), layer[15]);
layer[17] = model.hook(Dense(10, &dense_2_w, &dense_2_b), layer[16]);
layer[18] = model.hook(Softmax(), layer[17]);
layer[19] = model.hook(Output(shape(10,1,1), nnom_output_data), layer[18]);
model_compile(&model, layer[0], layer[19]);
return &model;
}
~~~

### Model info, memory

This is an example printed by `model_compile()`, which is normally called by `nnom_model_create()`.

~~~
Start compiling model...
Layer(#) Activation output shape ops(MAC) mem(in, out, buf) mem blk lifetime
Expand Down Expand Up @@ -142,6 +234,16 @@ Compling done in 179 ms
~~~

It shows the run order, Layer names, activations, the output shape of the layer, the operation counts, the buffer size, and the memory block assignments.

Later, it prints the maximum memory cost for each memory block. Since the memory block is shared between layers, the model only use 3 memory blocks, altogether gives a sum memory cost by `18144 Bytes`.

### Runtime statistices

This is an example printed by `model_stat()`.

> This method requires a microsecond timestamp porting, check [porting guide](Porting_and_Optimisation_Guide.md)
~~~
Print running stat..
Layer(#) - Time(us) ops(MACs) ops/us
Expand Down Expand Up @@ -170,10 +272,27 @@ NNOM: Total Mem: 20236
~~~

Calling this method will print out the time cost for each layer, and the efficiency in (MACops/us) of this layer.

This is very important when designing your ad-hoc model.

For example, #2 layer has only `14.47 MACops/us`, while #5, #8 and #10 are around `60 MACops/us`. This is due to the input channel of #2 layer is 1, which cannot fulfil the optimisation conditions of CMSIS-NN. One simple optimization strategy is to minimize the complexity in #2 layer by reducing the output channel size.

-----

## Others

### Memeory management in NNoM

###
As mention, NNoM will allocate memory to the layer during the compiling phase.
Memory block is a minimum unit for a layer to apply.
For example, convolution layers normally apply one block for input data, one block for output data and one block for the intermediate data buffer.

~~~
Layer(#) Activation output shape ops(MAC) mem(in, out, buf) mem blk lifetime
-------------------------------------------------------------------------------------------------
#2 Conv2D - ReLU - ( 28, 28, 12) 84k ( 784, 9408, 36) 1 1 1 - - - - -
~~~

The example shows input buffer size `784`, output buffer size `9408`, intermediate buffer size `36`. The following `mem blk lifetime` means how long does the memory block last. All three block last only one step, they will be freed after the layer. In NNoM, the output memory will be pass directly to the next layer(s) as input buffer, so there is no memory copy cost and memory allocation in between layers.

2 changes: 2 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ API manuals are available within this site.

[5 min to NNoM Guide](guide_5_min_to_nnom.md)

[Development Guide](guide_development.md)

[The temporary guide](A_Temporary_Guide_to_NNoM.md)

[Porting and optimising Guide](Porting_and_Optimisation_Guide.md)
Expand Down
1 change: 1 addition & 0 deletions mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@ nav:
- Overview: 'index.md'
- Guides:
- 5 min to NNoM: 'guide_5_min_to_nnom.md'
- Development Guide: 'guide_development.md'
- APIs:
- Utils (Python): 'api_nnom_utils.md'
- Model: 'api_model.md'
Expand Down

0 comments on commit 776bc91

Please sign in to comment.