# High Performance Computing 4th Lecture

11/Oct/2016 Yuya Kobayashi

## **Reviewed Paper**

 Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, Pritish Narayanan.
 Deep Learning with Limited Numerical Precision.
 Proceedings of the 32nd International
 Conference on Machine Learning (ICML-15)

# Background

- Natural error resiliency of neural network (NN) [Bottou & Bousquet, 2007].
  - In the presence of statistical approximation and estimation errors, high-precision computing is not necessary for DNN.
- Large scale systems specialized for DNN do not utilize natural error resiliency, except for Asynchronous SGD.
- This paper shows a performance of NN and a prototype hardware with 16-bit fixed point number.
  - Fixed point compute units are faster, consume less resources and power.
  - A data is of smaller data size.

# Idea of system



Limited Precision Arithmetic fixed-point number type



This notation provides how long bit is assigned to integer part and fraction part in a decimal number.

# **Rounding Mode**



 $Round\left(x,\left< \mathtt{IL}, \mathtt{FL} \right> \right)$ 

Round-to-nearest(RtN)
 Stochastic rounding





## **Rounding Mode**

If a calculated result is outside the range of <IL, FL>, then we saturate it to upper or lower bound of <IL, FL>.

$$Convert (x, \langle IL, FL \rangle) = \begin{cases} -2^{IL-1} & \text{if } x \leq -2^{IL-1} \\ 2^{IL-1} - 2^{-FL} & \text{if } x \geq 2^{IL-1} - 2^{-FL} \\ Round(x, \langle IL, FL \rangle) & \text{otherwise} \end{cases}$$
(1)

Multiply and accumulate (MACC) operation

Calculating  $\mathbf{c}_0 = \mathbf{a} \cdot \mathbf{b}$  by 2 steps.

- a, b : </L, FL> fixed-point number d-dimension vector

 $- \mathbf{c}_0 \quad : \langle \tilde{\mathtt{IL}}, \tilde{\mathtt{IF}} \rangle \text{ fixed-point number}$ 

- 1. Compute  $z = \sum_{i=1}^{d} a_i b_i$ -  $a_i b_i : \langle 2 \text{ IL}, 2 \text{ FL} \rangle$  fixed-point -  $z : \{ \log_2 d + 2 (\text{IL} + \text{FL}) \}$  bit length fixed-point
- 2. Convert:  $c_0 = Convert(z, \langle \tilde{\mathbf{IL}}, \tilde{\mathbf{IF}} \rangle)$

Multiply and accumulate (MACC) operation

- advantage of this 2-steps methodology
  - easy to implement with FPGA
  - one rounding per one multiplying operation
  - easy to simulate in CPU/GPU, BLAS library

# Evaluation

Going to evaluate error of network with 16-bit fixed point arithmetic by comparing with 32-bit floating point one.

- Network
  - DNN
  - Convolutional Neural Network(CNN)
- Data set
  - MNIST
  - CIFAR10

## Evaluation

- Weights and Biases in network are to be initialized randomly.
- HyperParameters (e.g. number of layer, momentum, learning rate, ...) is the same between baseline experiment and 16-bit fixed point one.

• Fixed-point number is represented in 16 bits.

#### MNIST

- 60,000 training images/ 10,000 test images
- 28 × 28 pixels in a image
- Each pixel in the images has a value in [0,1].



from テストの実行 - MNIST 画像認識データ セットに取り組む (https://msdn.microsoft.com/ja-jp/magazine/dn745868.aspx)

#### DNN

- Fully connected network
- 2 hidden layers containing 1000 units with ReLU activation function
- Each weight is initialized randomly from *N*(0, 0.01). The bias vector initialized to 0.
- Training using minibatch SGD to minimize the cross entropy objective function.
  - a minibatch size is 100.

- Precision of fixed point in which test error is close to the one with float is
   <2,14> in RtN scheme, or
   <8,8> in Stochastic rounding scheme.
  - RtN lose gradient
    information more readily,
    then some weights are not
    updated.





The network is similar to LeNet-5.

• 5×5 filter, 2×2 non-overlapped max pooling



- hyper parameter
  - learning rate = 0.1 \* (0.95)<sup>(# of completed epoch)</sup>
  - momentum = 0.9
  - weight decay = 0.0005
- Output from the convolutional layers is represented in <6,10> fixed-point.
  - If IL < 6, the outputs are lower than a range the fixedpoint can represent.



- The CIFAR-10 dataset consists of 60000 32x32 color images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.
- The image RGB values are scaled to [0,1] for the evaluation.



- 3 convolutional layers, each contains 64 5×5 filters
- max pooling function over 3×3 window using a stride of 2



- Parameter
  - The learning rate is 0.01 (at begin), 0.005(after 50 epoch), 0.0025(after 75 epoch), 0.00125(after 100 epoch).
- Outputs from layers are represented in the <4,12> format.

RtN scheme results in divergence



Changing the precision to <4, 16> improves the network performance

# Hardware Prototyping

- FPGA-based hardware accelerator for matrixmatrix multiplication
  - FPGA contains DSP units that are well-suit to implement fixed point arithmetic.
  - FPGA has potential in performance and power efficiency.

# Components of the prototype

- Xilinx Kintex325T FPGA
  - 840 DSP multiply-accumulate unit
  - 2MB on-chip lock RAM
- 8GB DDR3
- PCIe Bus between the FPGA and the Host
  - The bandwidth between the off-chip DDR3 memory and the FPGA is 6.4 (GB/s).

### Inside of the accelerator















move from the DDR3 to the FPGA on-chip memory







## Systolic Array(SA) Architecture



### matrix multiplication in SA



## Evaluating the prototype

- 28×28 SA is implemented on the FPGA.
  - A maximum circuit operation frequency of 166MHz and a power consumption of 7W are estimated.
     The throughput is 260 G-ops/s.
    - => The power efficiency is 37 G-ops/s/W.
      - The range of power efficiency of NVIDIA GT650m and GTX780, the Intel i7-3720QM is 1~5 G-ops/s/W

| RESOURCE                               | USAGE                                                        | Available on<br>XCVK325T         | UTILIZATION<br>RATIO     |
|----------------------------------------|--------------------------------------------------------------|----------------------------------|--------------------------|
| LUTS<br>Flip-flops<br>DSP<br>Block RAM | $\begin{array}{c} 62922 \\ 146510 \\ 812 \\ 334 \end{array}$ | $203800 \\ 407600 \\ 840 \\ 445$ | $31\%\ 36\%\ 97\%\ 75\%$ |

Table 1. FPGA resource utilization.

## Related work

- (Iwata et al., 1989) proposes 24-bit floating back propagetion algorithm
- (Hammerstrom, 1990) presents a framework for on-chip learning using 8 to 16 bit fixed-point arithmetic
- (Holt & Hwang, 1993) performs theoretical analysis of a neural network's ability to learn when trained in a limited precision setting

# Conclusion

- They envision the emergence of hardwaresoftware co-designed systems for large-scale machine learning based on relaxed, inexact models of computing.
  - The Stochastic rounding may result in better performance of a neural network than the conventional rounding.
  - They implemented the high-throughput, energyefficient prototype for matrix multiplication with 16bit fixed point representation.