### Optimizing Memory Efficiency for Deep Convolutional Neural Networks on GPUs

Presenter: Lingqi ZHANG 2018-12-19

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

### **Convolutional Neural Networks (CNN)**

- Visual System<sup>[1]</sup>
  - Simple Receptive Field
    - Activation related to location
  - Complex Receptive Field
    - Activation related to patterns



The Visual Pathway.—Source: https://commons.wikimedia.org/wiki/File:Human\_visual\_pathway.svg



- Convolution Layer
- Pooling Layer
- Fully Connected Layer
- Loss Layer
  - Softmax
  - Sigmoid Cross-Entropy
  - Euclidean Loss



Architecture of a CNN. — Source: <u>https://www.mathworks.com/videos/introduction-to-deep-learning-what-are-</u> convolutional-neural-networks--1489512765771.html



### **Convolutional Layer**

#### Function:

**Extracts features** 



Feature visualization of convolutional net trained on ImageNet from [Zeiler & Fergus 2013]

#### Features in a trained network

https://medium.com/technologymadeeasy/the-best-explanation-of-convolutional-neural-networks-on-theinternet-fbb8b1ad5df8

| Operation                        | Filter                                                                           | Convolved<br>Image |
|----------------------------------|----------------------------------------------------------------------------------|--------------------|
| Identity                         | $\begin{bmatrix} 0 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}$              |                    |
|                                  | $\begin{bmatrix} 1 & 0 & -1 \\ 0 & 0 & 0 \\ -1 & 0 & 1 \end{bmatrix}$            |                    |
| Edge detection                   | $\begin{bmatrix} 0 & 1 & 0 \\ 1 & -4 & 1 \\ 0 & 1 & 0 \end{bmatrix}$             |                    |
|                                  | $\begin{bmatrix} -1 & -1 & -1 \\ -1 & 8 & -1 \\ -1 & -1 & -1 \end{bmatrix}$      |                    |
| Sharpen                          | $\begin{bmatrix} 0 & -1 & 0 \\ -1 & 5 & -1 \\ 0 & -1 & 0 \end{bmatrix}$          |                    |
| Box blur<br>(normalized)         | $\frac{1}{9} \begin{bmatrix} 1 & 1 & 1 \\ 1 & 1 & 1 \\ 1 & 1 & 1 \end{bmatrix}$  |                    |
| Gaussian blur<br>(approximation) | $\frac{1}{16} \begin{bmatrix} 1 & 2 & 1 \\ 2 & 4 & 2 \\ 1 & 2 & 1 \end{bmatrix}$ |                    |

**Different Filters work on Picture** 

https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/



Example of Pooling http://mlss.tuebingen.mpg.de/2015/slides/fergus/Fergus\_1.pdf

# Softmax Layer

#### **Function:**

This paper Specified 'Loss Layer' into 'Softmax Layer'

> This paper said that "Before the softmax layer, there usually exist fully-connected layers" But neglect the discussion of "fully-connected layers" in other parts.

$$Maxv[N_{x}] = \sum_{x=0}^{X} \sum_{y=0}^{Y} max(In[N_{x}][C_{y}])$$
  

$$Midv1[N_{x}][C_{y}] = \sum_{x=0}^{X} \sum_{y=0}^{Y} (In[N_{x}][C_{y}] - Maxv[N_{x}])$$
  

$$Sumv[N_{x}] = \sum_{x=0}^{X} \sum_{y=0}^{Y} sum(Midv2[N_{x}][C_{y}])$$
  

$$Out[N_{x}][C_{y}] = \sum_{x=0}^{X} \sum_{y=0}^{Y} (Midv2[N_{x}][C_{y}]/Sumv[N_{x}])$$

# DATA Layout

#### **Definition:**

A data layout is a structure applied to a system that defines how the data fields are organized.<sup>[1]</sup> (First search result by Google)

I think the author means the arrangement of multidimensional array.

|       |           |                           | 1                          | w Here 2                  |                               |
|-------|-----------|---------------------------|----------------------------|---------------------------|-------------------------------|
|       |           |                           | 18-2<br>                   |                           | 1727-11827<br>24. 127-29 Juli |
|       |           | 100 million (100 million) | ÷                          | the states                |                               |
|       |           |                           | - 77 E                     | 8/4_+2+7<br>578_362/5_7/2 | 247-29247 24                  |
|       |           | 1.2920                    |                            | 1225 1284                 | /5.86.922                     |
|       |           |                           | 197 2042 /2<br>1982 / 1995 |                           |                               |
|       | _ <u></u> |                           | 2/547.732                  |                           |                               |
| Index | 0         | 1                         | 2                          | 3                         | 4                             |
| 0     | 65,340    | 12,483                    | 138,189                    | 902,960                   | 633,877                       |
| 1     | 5,246     | 424,642                   | 650,380                    | 821,254                   | 866,122                       |
| 2     | 89,678    | 236,781                   | 60 <b>1,69</b> 1           | 329,274                   | 913,534                       |
| 3     | 103,902   | 4,567                     | 733,611                    | 263,010                   | 85,550                        |
| 4     | 2,778     | 658,305                   | 128,788                    | 978,155                   | 620,702                       |
| 5     | 45,024    | 55,058                    | 705,586                    | 89,672                    | 384,605                       |
| 6     | 780       | 47,538                    | 523,784                    | 556,801                   | 617,107                       |
| 7     | 32,667    | 350,890                   | 834,753                    | 638,108                   | 85,188                        |
| 8     | 56,083    | 145,582                   | 775,040                    | 548,322                   | 756,587                       |
| 9     | 41,123    | 543,542                   | 537,738                    | 513,048                   | 418,482                       |

[1]https://help.dsync.com/hc/en-us/articles/115006785467-What-is-a-data-layout-

## **CNN Libraries**



Nearby index also physically adjacent



# Neural Networks

| Name    | Dataset used<br>when first<br>proposed | Description                                |  |  |
|---------|----------------------------------------|--------------------------------------------|--|--|
| LeNet   | MINIST                                 | Handwritten character recognition (Number) |  |  |
| Cifar   | CIFAR10                                | 10 categories of<br>objects                |  |  |
| AlexNet |                                        |                                            |  |  |
| ZFNet   | ImageNet                               | 1 million real-word<br>images              |  |  |
| VGG     |                                        |                                            |  |  |

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

## Introduction

### Curent situation

- Success in CNN (e.g. Alex)
- GPU optimizations
  - (e.g. Caffe)
- Reducing arithmetic complexity

### Problems

- No one focus on memory efficiency
- 2 issues
  - Data Layouts
  - Redundant off-chip memory access.

### Data Layout



Fig. 1. Performance comparison between the CHWN layout (cuda-convnet2) and NCHW layout (cuDNNv4) on convolutional and pooling layers in AlexNet [12]

- POINT 1:
  - GPU thread organization highly depends on data layout
  - Data layout determines the memory access pattern
- EVIDENCE 1 (Fig1):
  - Suitable layout lead up to 6.9x layer-level speedup
  - Suitable layout even speedup convolutional layer to up 2.3x
- POINT 2: size of each dimension affect performance
  - Because, each dimension has distinct memory access patterns
- DEDUCTION 1 (from POINT 2):
  - Performance impact from data layout is complex and difficult to reason about.
- DEDUCTION 2 (from EVIDENCE 1):
  - Single data layout cannot deliver the best performance for all the layers.
- CURRENT SITUATION (PROBLEM):
  - Current libraries only employ one data layout for all the CNN layers.

### Redundant off-chip memory access

### • EVIDENCE (from authors' analysis):

- memory-bounded pooling layers and classifier (softmax) layers is far from optimal
- **DUE TO:** overlook on their off-chip memory data access
- PROBLEM 1:
  - CNN requires multiple steps to complete (data dependency exists)
  - **CURRENT SITUATION:** use kernel for each step
  - **PROBLEM:** data pass through the bandwidth-limited off-chip memory
- PROBLEM 2:
  - Leveraging data locality for high memory performance is an important optimization
  - **CURRENT SITUATION (PROBLEM):** to optimize locality for different data layouts has not been addressed in existing CNN libraries.

# Contributions

#### **Benchmark 1 and optimization 1**

1. Benchmarked performance impact of different layouts in various CNN layers. Derived a heuristic guide for layout selection.

**Optimization 2** 

2. Proposed a layout transformation on GPUs. Integrated automatic layout selection and transformation into Caffe

### **Benchmark 2**

3. Benchmarked memory memory behavior of pooling and softmax layers. Further optimize their memory access efficiency on GPUs.

#### **Experiments**

4. The authors "perform rigorous evaluation and result analysis on different types of layers and representative networks, and demonstrate high performance improvements for both single layers, and complete networks"

I think the authors want to express that they applied their optimizations in different types of layers and representative networks.

# Benchmarks

- N: Batch
- C: Feature Map
- H/W: Image size
- Fh/Fw: Filter size
- S: slide

### • LAYERS:

- TABLE 1 shows the layers chosen from famous neural networks.
- Convolutional layer comes from this table.
- Pooling layer comes from this table.
- Softmax layer is benchmarked by several settings (described in section VI).

It's interesting that this paper did not benchmark convolutional layer in Alex Nex and pooling layer in VGG

|   | Layer                                | Ni  | Co     | H/W      | F <sub>w</sub> /F <sub>h</sub> | Ci       | S  | Description               |
|---|--------------------------------------|-----|--------|----------|--------------------------------|----------|----|---------------------------|
|   | CONV1 (CV1)                          | 128 | 16     | 28       | 5                              | 1        | 1  | L -NI-4[17].              |
|   | CONV2 (CV2)                          | 128 | 16     | 14       | 5                              | 16       | 1  | LeNet[17]:<br>Model Error |
|   | POOL1 (PL1)                          | 128 | -      | 28       | 2                              | 16       | 2  | rate: 0.18%               |
|   | POOL2 (PL2)                          | 128 | -      | 14       | 2                              | 16       | 2  | (epoch 200)               |
|   | CLASS1                               | 1   | 28 im  | ages ar  | nd 10 ca                       | tegories | 5  | (epoen 200)               |
|   | CONV3 (CV3)                          | 128 | 64     | 24       | 5                              | 3        | 1  |                           |
|   | CONV4 (CV4)                          | 128 | 64     | 12       | 5                              | 64       | 1  | Cifar10[15]:              |
|   | POOL3 (PL3)                          | 128 | -      | 24       | 3                              | 64       | 2  | Model Error               |
|   | POOL4 (PL4)                          | 128 | -      | 12       | 3                              | 64       | 2  | rate:14.04%               |
|   | CLASS2                               | 1   | 28 im  | ages ar  | nd 10 ca                       | tegories | 5  | (epoch 100)               |
|   | POOL5 (PL5)                          | 128 | -      | 55       | 3                              | 96       | 2  | ImageNet                  |
|   | POOL6 (PL6)                          | 128 | -      | 27       | 3                              | 192      | 2  | With                      |
|   | POOL7 (PL7)                          | 128 | -      | 13       | 3                              | 256      | 2  | AlexNet[12]               |
|   | CLASS3                               | 12  | 28 ima | iges and | 1 1000 c                       | ategorie | es | Model                     |
|   | CONV5 (CV5)                          | 64  | 96     | 224      | 3                              | 3        | 2  |                           |
|   | CONV6 (CV6)                          | 64  | 256    | 55       | 5                              | 96       | 2  |                           |
|   | CONV7 (CV7)                          | 64  | 384    | 13       | 3                              | 256      | 1  | ]<br>                     |
|   | CONV8 (CV8)                          | 64  | 384    | 13       | 3                              | 384      | 1  | ImageNet with<br>ZFNet    |
|   | POOL8 (PL8)                          | 64  | -      | 110      | 3                              | 96       | 2  | Model[25]                 |
|   | POOL9 (PL9)                          | 64  | -      | 26       | 3                              | 256      | 2  |                           |
|   | POOL10 (PL10)                        | 64  | -      | 13       | 3                              | 256      | 2  | ]                         |
|   | CLASS4                               | 6   | 4 imag | ges and  | 1000 c                         | ategorie | S  |                           |
|   | CONV9 (CV9)                          | 32  | 64     | 224      | 3                              | 3        | 1  |                           |
|   | CONV10 (CV10)                        | 32  | 256    | 56       | 3                              | 128      | 1  | ImageNet with             |
|   | CONV11 (CV11)                        | 32  | 512    | 28       | 3                              | 256      | 1  | VGG Model                 |
| 1 | CONV12 (CV12)                        | 32  | 512    | 14       | 3                              | 512      | 1  | [22]                      |
|   | CLASS5 32 images and 1000 categories |     |        |          |                                |          |    |                           |

TABLE 1: THE CNNS AND THEIR LAYERS USED IN THE EXPERIMENTS.

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

### Data Layout in Convolutional Layers

• Benchmark:

I think these authors mean "Direct Convolution" and "Matrix Multiplication" here

- Comparison of CHWN and NCHW with their best performance implementation (cuda-convnet and cuDNN respectively)
- Observations:
  - cuda\_convnet outperform cuDNN for CV1-5 and CV9
    - Because C no larger than 64

I think these authors mean Ci here



Fig. 3. Performance comparison between two different data layouts for the convolutional layers in Table 1. The performance is normalized to cuda-convnet measured on a GTX TITAN BLACK.

### **Data Layout in Convolutional Layers**

Again, Ci here

- Benchmark:
  - To further identify the sensitivities of data layouts on each dimension, the researchers collect the results with one varying dimension size (N or
    - C) and the other three being fixed

If batch size N is 128, cuda-convnet enables<br/>each thread to handle 4 images.Analysis:If batch size is less, the reuse for images

- if N<128
  - Cuda-convnet could not achieve top performance
    - (implementation relation Significant?
- If C < 32:

Else

- Overhead of unrolling matrix (in cuDNN) is more evidence
- Heuristic Layout Selection:

Better to choose CHWN

Better to choose NCHW

if C<Ct or N>Nt :

- Varies depends one system.
- Titan Black:
- (Ct=32, Nt=128) Titan X: (Ct=128, Nt=64)





b. The performance with different values of C.

Fig. 4. Sensitivity study of data layouts on the N and C dimensions. CONV7 in Table 1 is used while others show similar trends.

### **Data Layout in Convolutional Layers**

### • Benchmark:

 Performance of various convolutional layers using FFT, FFT-Tiling and Matrix Multiplication with the NCHW layout compared to cudaconvenet with the CHWN data layout.

#### • Observation:

- If {N is large}||{filter kernel is large}||{C is large}
  - FFT is better than MM
  - REASON:



Fig. 5. Speedups of the FFT-based approach over the cuda-convets.

B

### **Data Layout in Pooling Layers**

#### • Benchmark:

- Performance of pooling layers with different data layouts
- Cuda-convnet (CHWN) vs Caffe & cuDNN (NCHW)

### Conclusion:

- CHWN always better than NCHW
- REASON:
  - NCHW layout cannot ensure coalesced memory access.



Figure 6. Performance comparison between different data layouts for the pooling layers in Table 1. The performance is normalized to cuda-convnet. The numbers on top denote the highest bandwidth (GB/S) achieved for each layer.

| In fact CHWN->NCHW                                                                                        | <ol> <li>global void Transformation (float *in, float *out) {         <pre>             //from CHWN to NHWC.</pre>             int tx =threadIdx.x, bx= blockIdx.x;             Kernel             int by =blockIdx.y, bz=blockIdx.z;                 (a)                 out[((((tx*gridDim.z+bz)*gridDim.y+by)*gridDim.x)+bx] =</li></ol>                                                                                           |
|-----------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| C<br>A Fast Data Layout<br>Transformation for                                                             | <ol> <li>template <int c="" int="" n,=""></int></li> <li>global void OptTransformation (float2 *in, float *out) {</li> <li>int tx =threadIdx.x, ty= threadIdx.y, bx =blockIdx.x,</li> <li>by=blockIdx.y;</li> <li>//1 Matrix flatten 4D to 2D: [C][H][W][N]-&gt;[C*H*W][N]</li> <li>int D2_W= N/2; int D2_H = gridDim.y*gridDim.x*blockDim x:</li> <li>//Shared Memory Tile for Subblock Transpose</li> </ol>                         |
| CNNs<br>Optimizations:<br>1. Change 4D<br>(10/32,32>, <h,w><br/>According to the<br/>result of C%32</h,w> | 7shared float2 sh[C][33]; //Padding 1 float2<br>8. for (int i=0;i< N /64;i++) { //handle 64 images every time<br>9. int m = by*gridDim.x*blockDim.y+bx*blockDim.y+ty;<br>10. /2 Subgrouping in Shared Memory<br>11 int D3_H = m/32; int D3_W = m % 32;<br>12 int index = D3_W + D3_H*32; Optimized<br>13 sh[ty][tx] = in[index*D2_W+tx+i*32]; Kernel<br>14syncthreads(); (b)<br>15. //3 Vector Transpose Index<br>16. $if(C%32==0)$ { |
| transform to 2D<br>transform                                                                              | 17. $out[(2*ty+i*64)*D2_H+(bx)*32+tx] = sh[tx][ty].x;$<br>18. $out[(2*ty+1+i*64)*D2_H+(bx)*32+tx] = sh[tx][ty].y;$<br>19. <i>else if(C%16==0){</i><br>20. $out[(2*ty+i*64)*D2_H+bx*32+tx] = sh[tx][ty].x;$<br>21. $out[(2*ty+1+i*64)*D2_H+bx*32+tx] = sh[tx][ty].y;$<br>22. $out[(2*(ty+16)+i*64)*D2_H+bx*32+tx] = sh[tx][ty+16].x;$                                                                                                  |
| <ol> <li>Shared memory tiling</li> <li>Vectorization to use</li> <li>8 byte access</li> </ol>             | <pre>23. out[(2*(ty+16)+1+i*64)*D2_H+bx*32+tx]= sh[tx][ty+16].y;} 24syncthreads(); 25. }//end loop 26. }//end kernel Fig. 7. Kernel code for data layout transformation</pre>                                                                                                                                                                                                                                                         |

I think vectorization increase the performance by increase bandwidth of global memory here

D

### Wrap Up: Automatic CNN Data Layout Support

- Code Modification:
  - add a new field in each convolutional and pooling layer to indicate the data layout choice.
    - Use the heuristic method proposed to set layout.
  - At the completion time of layer, an additional check is needed, to determine the overhead of data layout transformation over the performance improvement obtained from the suitable data layout.

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

### Memory Analysis and Optimization on Pooling Layers

- ANOTHER PROBLEM:
  - Redundant data access
- SOLUTION:
  - THREAD FUSING
    - Cache the input data in Register File for reuse



Figure 6. Performance comparison between different data layouts for the pooling layers in Table 1. The performance is normalized to cuda-convnet. The numbers on top denote the highest bandwidth (GB/S) achieved for each layer.



### В

### Memory Analysis and Optimization on Softmax Layers

### • PROBLEM:

- The highest bandwidth achieved for the softmax layers (BL\_Best) is far from optimization
- ANALYSIS:
  - There are 5 kernels to compute softmax layers step by step, which involves redundancy in using global memory.
  - Not enough parallelism in inner loop

### • SOLUTION:

- Kernel fusing
- Inter-step communication with share memory
- Parallel inner loop

$$\begin{aligned} Maxv[N_{x}] &= \sum_{x=0}^{X} \sum_{y=0}^{Y} max(In[N_{x}][C_{y}]) \\ Midv1[N_{x}][C_{y}] &= \sum_{x=0}^{X} \sum_{y=0}^{Y} (In[N_{x}][C_{y}] - Maxv[N_{x}]) \\ Sumv[N_{x}] &= \sum_{x=0}^{X} \sum_{y=0}^{Y} sum(Midv2[N_{x}][C_{y}]) \\ Out[N_{x}][C_{y}] &= \sum_{x=0}^{X} \sum_{y=0}^{Y} (Midv2[N_{x}][C_{y}]/Sumv[N_{x}]) \end{aligned}$$
Theoretical Peak: 235GB/s



Fig. 13. Performance comparison (GB/S) of softmax layers with a wide range of configurations. x/y means the batch size as x and the number of categories as y.

### I guess here means the computation of a window with size X x Y

| 1.  | dim3 threads(num_category); dim3 blocks(num_img);                                                                          |
|-----|----------------------------------------------------------------------------------------------------------------------------|
| 2.  | global void <b>opt_kernel</b> (float *mat, float *out){                                                                    |
| 3.  | shared float in_tile[C]; // C < 11K (k=1024)                                                                               |
| 4.  | shared float tmp_tile[1024]; //for reduction                                                                               |
| 5.  | int tidx = threadIdx.y*blockDim.xx+threadIdx.x;                                                                            |
| 6.  | <pre>for(uint i = tidx;i<num_category;i=i+blockdim.y*blockdim.x)< pre=""></num_category;i=i+blockdim.y*blockdim.x)<></pre> |
| 7.  | in_tile[i] = mat[blockIdx.x* num_category +threadIdx.x];                                                                   |
| 8.  | // step 1                                                                                                                  |
| 9.  | max_reduction_thread_block (in_tile, tmp_tile);                                                                            |
| 10. | // step 2                                                                                                                  |
| 11. | for(uint i = tidx;i <num_category;i=i+blockdim.y*blockdim.x)< th=""></num_category;i=i+blockdim.y*blockdim.x)<>            |
| 12. | in_tile[i] = in_tile[i]-tmp_tile[0]; //tmp_tile[0] store the max                                                           |
| 13. | }                                                                                                                          |
|     | Fig. 9. Optimized kernel after kernel fusion (C<11K)                                                                       |
|     | · · · · ·                                                                                                                  |

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

Speedup of a layout over an alternative one

### A Results on Data Layout Optimization

- Experiment:
  - Performance of transformation
  - overhead of transformation when ensemble it into Convolutional Layer

### • CONCLUSION:

- Data layout has significant performance impact
- Optimizations in transformation works
- By considering the data layout transformation overhead, Most of layers still gain performance by doing transformation in layout.



Fig. 11. Achieved memory bandwidth using three methods for data layout transformation. The Transform-Opt2 is not applicable for CV10. CV11. CV12 with an average of 7.5x speedup Memory overhead: 73.5MB



Fig. 5. Speedups of the FFT-based approach over the cuda-convets.

В

### **Results on Off-chip Memory Access Optimization**

#### • Experiments:

- Performance comparison of different pooling layers
- Memory bandwidth comparison between optimized and original best alternative implementation.

#### reduced 9.1% global memory transactions 36% DRAM accesses

Achieve higher performance with an average of 14.3%



Fig. 13. Performance comparison (GB/S) of softmax layers with a wide range of configurations. x/y means the batch size as x and the number of categories

<sup>as y.</sup> Communication: 2.81x speedup average Parallel inner loop: 5.13x speedup average

### В

### **Results on Whole Networks**

- Experiments:
  - Integrate optimizations into cuDNN and compare.
  - detailed performance comparison of different layers in AlexNet.
- Conclusion:
  - Flexible data layout: 72% improvement
  - Off-chip memory access optimization contributes 28%
    - Pooling layers:
      - speedup by 27.8%
    - Softmax layers:
      - 20.1x over cuDNN
      - 8.2x over cuda-convnet
    - Overall
      - 16% over cuda-convnet
      - 46% over cuDNN-Best



Fig. 15. The performance comparison of different layers in AlexNet, The performance is normalized to cuDNN-MM.

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

# CONCLUSION(cite)

- "Our detailed study unveils the impact of data layouts on different types of CNN layers and their performance implications."
- "We propose efficient data layout support as our solution."
- "We further look into the memory access patterns of the memory-bounded layers, and propose effective optimizations to substantially reduce their off-chip memory requests and inter-kernel communication."

# Outline

- Background: Convolutional Neural Networks
- Introduction
- Memory Issue A: Data layout
- Memory issue B: off-chip memory accesses
- Results
- Conclusion
- Discussion

## Discussion

#### • Things I considered as big issues.

- Experiment setting without carefully control variable
- Changing Concepts several times
- Neglect the fact that Pooling layers (better to use NCHW) are nearly always inserted into two Convolutional Layers (CHWN). (Lack experiments)
- Over-exaggerate
- Things I considered as misses.
  - Inconsistance
  - Neglect ReLU Layer and Fully Connected Layer.
  - CODEs not present well
  - Never mention time consumption
- Things could be done better.
  - Further experiment to compare the performance difference between different convolutional implementations (with same layout)
  - Provide analysis of Cache Miss in Pooling Layers.
  - Explain why specific Layer is chosen while others not

# Change Concepts

- NCHW = best implementation in NCHW layout = cuDNN MM
- CHWN = CHWN with cuda-convnet
- BL\_Best (highest bandwidth achieved in existing libraries) = cuDNN
  - While in fact according to (the experiment result of) the paper, it should be cuda-convnet
- Choose Layout = Choose implementations

## Inconsistance



### Inconsistance

(Not sure) talk about something that is not directly relate to their work. The second one is *redundant off-chip memory accesses*. Our performance analysis shows that the memory efficiency of the memory-bounded pooling layers and classifier (i.e., softmax) layers is far from optimal due to the overlook on their off-chip memory data accesses. First, a CNN usually requires multiple steps to complete and there exists sequential data dependence across the steps. The common practice is to use a kernel for each step. However, it incurs high cost for inter-kernel data communication as the data pass through the bandwidthlimited off-chip memory. Second, leveraging data locality for high memory performance is an important optimization. However, how to optimize locality for different data layouts has not been addressed in existing CNN libraries.

.. . . . . .