Easy benchmarking of all public open-source implementations of convnets. A summary is provided in the section below.
Machine: 6-core Intel Core i7-5930K CPU @ 3.50GHz
+ NVIDIA Titan X
+ Ubuntu 14.04 x86_64
##Imagenet Winners Benchmarking I pick some popular imagenet models, and I clock the time for a full forward + backward pass. I average my times over 10 runs. I ignored dropout and softmax layers.
AlexNet (One Weird Trick paper) - Input 128x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-fp16 | ConvLayer | 92 | 29 | 62 |
CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 96 | 30 | 66 |
CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 96 | 32 | 64 |
Nervana-fp32 | ConvLayer | 101 | 32 | 69 |
fbfft | fbnn.SpatialConvolution | 104 | 31 | 72 |
Chainer | Convolution2D | 177 | 40 | 136 |
cudaconvnet2* | ConvLayer | 177 | 42 | 135 |
CuDNN[R2] * | cudnn.SpatialConvolution | 231 | 70 | 161 |
TensorFlow | conv2d | 292 | 70 | 222 |
Caffe (native) | ConvolutionLayer | 324 | 121 | 203 |
Torch-7 (native) | SpatialConvolutionMM | 342 | 132 | 210 |
CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
Caffe-CLGreenTea | ConvolutionLayer | 1442 | 210 | 1232 |
Overfeat [fast] - Input 128x3x231x231
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 313 | 107 | 206 |
CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 326 | 113 | 213 |
fbfft | SpatialConvolutionCuFFT | 342 | 114 | 227 |
Nervana-fp16 | ConvLayer | 355 | 112 | 242 |
Nervana-fp32 | ConvLayer | 398 | 124 | 273 |
Chainer | Convolution2D | 620 | 135 | 484 |
cudaconvnet2* | ConvLayer | 723 | 176 | 547 |
CuDNN[R2] * | cudnn.SpatialConvolution | 810 | 234 | 576 |
Caffe | ConvolutionLayer | 823 | 355 | 468 |
TensorFlow | conv2d | 856 | 204 | 652 |
Torch-7 (native) | SpatialConvolutionMM | 878 | 379 | 499 |
CL-nn (Torch) | SpatialConvolutionMM | 963 | 388 | 574 |
Caffe-CLGreenTea | ConvolutionLayer | 2857 | 616 | 2240 |
OxfordNet [Model-A] - Input 64x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-fp16 | ConvLayer | 529 | 167 | 362 |
Nervana-fp32 | ConvLayer | 590 | 180 | 410 |
CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 615 | 179 | 436 |
CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 615 | 196 | 418 |
Chainer | Convolution2D | 885 | 251 | 632 |
fbfft | SpatialConvolutionCuFFT | 1092 | 355 | 737 |
cudaconvnet2* | ConvLayer | 1229 | 408 | 821 |
CuDNN[R2] * | cudnn.SpatialConvolution | 1099 | 342 | 757 |
Caffe | ConvolutionLayer | 1068 | 323 | 745 |
Torch-7 (native) | SpatialConvolutionMM | 1105 | 350 | 755 |
TensorFlow | conv2d | 1656 | 347 | 1309 |
CL-nn (Torch) | SpatialConvolutionMM | 3437 | 875 | 2562 |
Caffe-CLGreenTea | ConvolutionLayer | 5620 | 988 | 4632 |
GoogleNet V1 - Input 128x3x224x224
Library | Class | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Nervana-fp16 | ConvLayer | 283 | 85 | 197 |
Nervana-fp32 | ConvLayer | 322 | 90 | 232 |
CuDNN[R3]-fp32 | cudnn.SpatialConvolution | 431 | 117 | 313 |
CuDNN[R3]-fp16 | cudnn.SpatialConvolution | 501 | 109 | 392 |
Chainer | Convolution2D | 687 | 189 | 497 |
TensorFlow | conv2d | 1237 | 246 | 991 |
Caffe | ConvolutionLayer | 1935 | 786 | 1148 |
CL-nn (Torch) | SpatialConvolutionMM | 7016 | 3027 | 3988 |
Caffe-CLGreenTea | ConvolutionLayer | 9462 | 746 | 8716 |
###Spatial Convolution layer (3D input 3D output, densely connected)
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 256 | 101 | 155 |
cuda-convnet2 * | ConvLayer | 977 | 201 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 1077 | 312 | 765 |
CuDNN R2 * | cudnn.SpatialConvolution | 1019 | 269 | 750 |
Theano | CorrMM | 1225 | 407 | 818 |
Caffe | ConvolutionLayer | 1231 | 396 | 835 |
Torch-7 | SpatialConvolutionMM | 1265 | 418 | 877 |
DeepCL | ConvolutionLayer | 6280 | 2648 | 3632 |
cherry-picking**** | best per layer | 235 | 79 | 155 |
This table is NOT UPDATED For TITAN-X. These numbers below were on Titan Black and are here only for informational and legacy purposes.
Original Library | Class/Function Benchmarked | Time (ms) | forward (ms) | backward (ms) |
---|---|---|---|---|
Theano (experimental)*** | conv2d_fft | 1178 | 304 | 874 |
Torch-7 | nn.SpatialConvolutionBHWD | 1892 | 581 | 1311 |
ccv | ccv_convnet_layer | 809+bw | 809 | |
Theano (legacy) | conv2d | 70774 | 3833 | 66941 |
- * indicates that the library was tested with Torch bindings of the specific kernels.
- ** indicates that the library was tested with Pylearn2 bindings.
- *** This is an experimental module which used FFT to calculate convolutions. It uses a lot of memory according to @benanne
- **** The last row shows results obtainable when choosing the best-performing library for each layer.
- L1 - Input:
128x128
Batch-size128
, Feature maps:3->96
, Kernel Size:11x11
, Stride:1x1
- L2 - Input:
64x64
Batch-size128
, Feature maps:64->128
, Kernel Size:9x9
, Stride:1x1
- L3 - Input:
32x32
Batch-size128
, Feature maps:128->128
, Kernel Size:9x9
, Stride:1x1
- L4 - Input:
16x16
Batch-size128
, Feature maps:128->128
, Kernel Size:7x7
, Stride:1x1
- L5 - Input:
13x13
Batch-size128
, Feature maps:384->384
, Kernel Size:3x3
, Stride:1x1
- The table is ranked according to the total time forward+backward calls for layers (L1 + L2 + L3 + L4 + L5)
#####Breakdown
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 57 | 27 | 6 | 2 | 9 | 101 |
cuda-convnet2 * | ConvLayer | 36 | 113 | 40 | 4 | 8 | 201 |
cuda-convnet** | pylearn2.cuda_convnet | 38 | 183 | 68 | 7 | 16 | 312 |
CuDNN R2 | cudnn.SpatialConvolution | 56 | 143 | 53 | 6 | 11 | 269 |
Theano | CorrMM | 91 | 143 | 121 | 24 | 28 | 407 |
Caffe | ConvolutionLayer<Dtype> | 93 | 136 | 116 | 24 | 27 | 396 |
Torch-7 | nn.SpatialConvolutionMM | 94 | 149 | 123 | 24 | 28 | 418 |
DeepCL | ConvolutionLayer | 738 | 1241 | 518 | 47 | 104 | 2648 |
cherry-picking**** | best per layer | 36 | 27 | 6 | 2 | 8 | 79 |
Columns L1, L2, L3, L4, L5, Total are times in milliseconds
Original Library | Class/Function Benchmarked | L1 | L2 | L3 | L4 | L5 | Total |
---|---|---|---|---|---|---|---|
fbfft | SpatialConvolutionCuFFT | 76 | 45 | 12 | 4 | 18 | 155 |
cuda-convnet2 * | ConvLayer | 103 | 467 | 162 | 15 | 29 | 776 |
cuda-convnet** | pylearn2.cuda_convnet | 136 | 433 | 147 | 15 | 34 | 765 |
CuDNN R2 | cudnn.SpatialConvolution | 139 | 401 | 159 | 19 | 32 | 750 |
Theano | CorrMM | 179 | 405 | 174 | 29 | 31 | 818 |
Caffe | ConvolutionLayer<Dtype> | 200 | 405 | 172 | 28 | 30 | 835 |
Torch-7 | nn.SpatialConvolutionMM | 206 | 432 | 178 | 29 | 32 | 877 |
DeepCL | ConvolutionLayer | 484 | 2144 | 747 | 59 | 198 | 3632 |
cherry-picking**** | best per layer | 76 | 45 | 12 | 4 | 18 | 155 |