This repo contains the Assignments from Cornell Tech's ECE 5545 - Machine Learning Hardware and Systems offered in Spring 2023
The assignment provided several tasks, the first of which was to research the peak FLOPs/s and memory bandwidth of at least 10 different chips belonging to diverse platforms - CPUs, GPUs, ASICs and SoCs and plot their roofline plots. The next task involved calculating the FLOPs, memory footprint and operational intensity of 10 different CNN model architectures and building roofline plots for each both on GPU and CPU on Google Colab. Next, the performance of each of the DNNs was benchmarked by plotting the inference latency vs FLOPs and number of parameters for various batch sizes for both GPU and CPU on Colab.
This assignment focusses on model compression and deployment for an audio DNN to an Arduino Nano 33 BLE using pruning and quantization for the purpose of word classification. Initial tasks explored audio preprocessing, estimating the model FLOPs, memory footprint and model training on 2-3 keywords. Next the model was deployed to the Arduino Nano 33 BLE microcontroller and the preprocessing, model inference and post-processing time reported. The assignment then focusses on pruning and quantization - two model compression techniques. The implementation of quantization can be found in a2/src/quant.py and a2/src/quant_conversion.py (where the model weights, biases and activations were quantized to lower bit precisions) while the implementation of pruning can be found in the notebook a2/src/6_pruning.ipynb. Quantization Aware Training (QAT) and Post-training quantization were implemented for 2, 4, 6, 8 bit precisions respectively and their results analysed in the report. Different pruning techniques - structured and unstructured pruning were utilised for pruning the model architecture and experiments on the model accuracy conducted with and without finetuning. Finally, the pruned models were deployed on the Arduino Nano MCU and the accuracy and runtime for different pruning thresholds measured.
This assignment focuses on optimising DNN primitives like 1D-convolutions, 2D-convolutions, matrix multiplications and depthwise-separable convolutions for CPU and GPU using the TVM compiler. Techniques like tiling, blocking and threading were utilised for optimizing the computations on GPU, whereas parallelism, vectorization, loop unrolling and shared memory usage were utilised for optimising computations on the CPU. Achieved RANK-2 on the class leaderboard with an average runtime of 0.514ms, 0.003ms away from RANK 1. The details of the optimizations are briefly described in the report in a3/a3_submission.pdf, while the code can be found in a3/src/ops.py while the experiments can be observed in the notebooks provided in the a3 directory.
This assignment focuses on implementing various approximations of DNN primitives like Convolutions and Matrix Multiplications for the purpose of optimising these operations. These approximations include - im2col, winograd convolution, Fast Fourrier Transform for approximating convolutions, while SVD (for low rank-approximations of matrices) and LogMatMul (Log Matrix Multiplication which involves taking the log and adding instead of multiplying) were utilised for approximating matrix multiplications. Experiments were conducted to measure the reconstruction error for the various approximations for different floating point precisions, in addition to measuring the speedup for low rank approximation using SVD. Finally, experiments were conducted on the MNIST dataset to check the effect of the compression ratio on the accuracy of the model using SVD's low rank approximations on the final two layers of an MLP.
One of the greatest challenges faced in distributed training of deep neural networks is the communication bottleneck due to the frequent model updates transmitted across compute nodes. In order to alleviate these bottlenecks a variety of gradient compression techniques and algorithms have been utilised over the past few years which aim to minimise the decrease in accuracy caused by lossy compression while addressing the communication bottleneck. This term paper aims to provide a comprehensive survey of the gradient compression techniques that enhance the performance and efficiency of deep distributed training.