Skip to content

Commit

Permalink
use vendored version of cupy.pad with added performance optimizations (
Browse files Browse the repository at this point in the history
…#482)

## Overview 

This version provides faster, elementwise kernel implementations for common padding modes. 

It is under `_vendored` because most of `pad.py` is copied from CuPy itself. The only new part there is the `_use_elementwise_kernel` utility and the conditional branch where it evaluates to True. The newly written code is mostly in `pad_elementwise.py`.

I could potentially further refactor `pad.py` to remove most of the code and just call out to `cupy.pad` instead whenever we aren't using the elementwise kernels. 

This version should also be submited upstream to CuPy itself. 

Padding performance is substantially improved for modes `edge`, `symmetric`, `reflect` and `wrap`. Most places in cuCIM where we use padding, it is not the bottleneck, but it should still provide a small performance improvement in several places. I ran some benchmarks, and the largest impact I saw was around 25% reduction in run-time for `chan_vese`. 


## Benchmark Results (vs. `cupy.pad`)

In the following, the next-to-last column is the overall acceleration observed. It is large for small 2D or 3D images (>5x) and becomes relatively small for larger images (e.g. ~10% for 4k images).

The final column only relates to the amount of time spent on the host. That "accel. CPU" number always strongly favors the new implementation. It has lower host overhead because everything is done in a single kernel call rather than potentially using multiple kernels for each axis in turn. This kernel launch overhead explains why the overall benefit is much higher for the smaller image sizes.

shape | pad_width | dtype | mode | order | duration, old (ms) | duration, new (ms) | accel. | accel. CPU
------|-----------|-------|------|-------|--------------------|--------------------|--------|-----------
(256, 256) | 2 | uint8 | edge | C | 0.1278 | 0.0230 | 5.563 | 6.298
(256, 256) | 2 | uint8 | symmetric | C | 0.1286 | 0.0230 | 5.583 | 6.268
(256, 256) | 2 | uint8 | reflect | C | 0.1294 | 0.0236 | 5.479 | 6.165
(256, 256) | 2 | uint8 | wrap | C | 0.1246 | 0.0228 | 5.468 | 6.149
(256, 256) | 16 | uint8 | edge | C | 0.1276 | 0.0229 | 5.563 | 6.269
(256, 256) | 16 | uint8 | symmetric | C | 0.1305 | 0.0231 | 5.645 | 6.366
(256, 256) | 16 | uint8 | reflect | C | 0.1300 | 0.0235 | 5.539 | 6.220
(256, 256) | 16 | uint8 | wrap | C | 0.1270 | 0.0228 | 5.568 | 6.268
(256, 256) | 2 | uint8 | edge | F | 0.1300 | 0.0234 | 5.567 | 6.281
(256, 256) | 2 | uint8 | symmetric | F | 0.1291 | 0.0236 | 5.471 | 6.157
(256, 256) | 2 | uint8 | reflect | F | 0.1294 | 0.0238 | 5.427 | 6.080
(256, 256) | 2 | uint8 | wrap | F | 0.1254 | 0.0234 | 5.363 | 6.043
(256, 256) | 16 | uint8 | edge | F | 0.1279 | 0.0232 | 5.506 | 6.315
(256, 256) | 16 | uint8 | symmetric | F | 0.1294 | 0.0236 | 5.472 | 6.319
(256, 256) | 16 | uint8 | reflect | F | 0.1300 | 0.0239 | 5.434 | 6.262
(256, 256) | 16 | uint8 | wrap | F | 0.1262 | 0.0238 | 5.310 | 6.134
(1024, 1024) | 2 | uint8 | edge | C | 0.1279 | 0.0255 | 5.020 | 6.287
(1024, 1024) | 2 | uint8 | symmetric | C | 0.1285 | 0.0258 | 4.980 | 6.259
(1024, 1024) | 2 | uint8 | reflect | C | 0.1286 | 0.0263 | 4.888 | 6.118
(1024, 1024) | 2 | uint8 | wrap | C | 0.1253 | 0.0255 | 4.905 | 6.170
(1024, 1024) | 16 | uint8 | edge | C | 0.1277 | 0.0258 | 4.947 | 6.270
(1024, 1024) | 16 | uint8 | symmetric | C | 0.1286 | 0.0261 | 4.931 | 6.296
(1024, 1024) | 16 | uint8 | reflect | C | 0.1280 | 0.0264 | 4.845 | 6.132
(1024, 1024) | 16 | uint8 | wrap | C | 0.1249 | 0.0260 | 4.798 | 6.095
(1024, 1024) | 2 | uint8 | edge | F | 0.1289 | 0.0581 | 2.217 | 6.084
(1024, 1024) | 2 | uint8 | symmetric | F | 0.1304 | 0.0586 | 2.227 | 6.064
(1024, 1024) | 2 | uint8 | reflect | F | 0.1331 | 0.0590 | 2.257 | 6.059
(1024, 1024) | 2 | uint8 | wrap | F | 0.1278 | 0.0586 | 2.180 | 5.994
(1024, 1024) | 16 | uint8 | edge | F | 0.1299 | 0.0604 | 2.149 | 6.238
(1024, 1024) | 16 | uint8 | symmetric | F | 0.1315 | 0.0607 | 2.168 | 6.255
(1024, 1024) | 16 | uint8 | reflect | F | 0.1309 | 0.0614 | 2.133 | 6.070
(1024, 1024) | 16 | uint8 | wrap | F | 0.1275 | 0.0606 | 2.103 | 6.105
(4096, 4096) | 2 | uint8 | edge | C | 0.1291 | 0.1143 | 1.130 | 6.202
(4096, 4096) | 2 | uint8 | symmetric | C | 0.1296 | 0.1132 | 1.145 | 6.183
(4096, 4096) | 2 | uint8 | reflect | C | 0.1295 | 0.1151 | 1.125 | 6.064
(4096, 4096) | 2 | uint8 | wrap | C | 0.1266 | 0.1138 | 1.112 | 6.029
(4096, 4096) | 16 | uint8 | edge | C | 0.1295 | 0.1157 | 1.119 | 6.212
(4096, 4096) | 16 | uint8 | symmetric | C | 0.1301 | 0.1150 | 1.131 | 6.208
(4096, 4096) | 16 | uint8 | reflect | C | 0.1302 | 0.1168 | 1.115 | 6.088
(4096, 4096) | 16 | uint8 | wrap | C | 0.1272 | 0.1153 | 1.103 | 6.065
(4096, 4096) | 2 | uint8 | edge | F | 0.6624 | 0.6433 | 1.030 | 6.228
(4096, 4096) | 2 | uint8 | symmetric | F | 0.6639 | 0.6438 | 1.031 | 6.133
(4096, 4096) | 2 | uint8 | reflect | F | 0.6640 | 0.6441 | 1.031 | 6.003
(4096, 4096) | 2 | uint8 | wrap | F | 0.6638 | 0.6454 | 1.028 | 6.037
(4096, 4096) | 16 | uint8 | edge | F | 0.6909 | 0.6713 | 1.029 | 6.318
(4096, 4096) | 16 | uint8 | symmetric | F | 0.6915 | 0.6717 | 1.029 | 6.229
(4096, 4096) | 16 | uint8 | reflect | F | 0.6919 | 0.6724 | 1.029 | 6.082
(4096, 4096) | 16 | uint8 | wrap | F | 0.6923 | 0.6720 | 1.030 | 6.136
(40, 40, 40) | 2 | uint8 | edge | C | 0.2057 | 0.0239 | 8.610 | 9.765
(40, 40, 40) | 2 | uint8 | symmetric | C | 0.2014 | 0.0241 | 8.357 | 9.450
(40, 40, 40) | 2 | uint8 | reflect | C | 0.1999 | 0.0245 | 8.169 | 9.227
(40, 40, 40) | 2 | uint8 | wrap | C | 0.1969 | 0.0237 | 8.299 | 9.405
(40, 40, 40) | 16 | uint8 | edge | C | 0.2028 | 0.0235 | 8.633 | 9.760
(40, 40, 40) | 16 | uint8 | symmetric | C | 0.2000 | 0.0255 | 7.844 | 9.502
(40, 40, 40) | 16 | uint8 | reflect | C | 0.1988 | 0.0250 | 7.946 | 9.339
(40, 40, 40) | 16 | uint8 | wrap | C | 0.1948 | 0.0248 | 7.871 | 9.371
(40, 40, 40) | 2 | uint8 | edge | F | 0.1980 | 0.0248 | 7.994 | 9.322
(40, 40, 40) | 2 | uint8 | symmetric | F | 0.1963 | 0.0250 | 7.840 | 9.159
(40, 40, 40) | 2 | uint8 | reflect | F | 0.1952 | 0.0253 | 7.729 | 8.985
(40, 40, 40) | 2 | uint8 | wrap | F | 0.1898 | 0.0251 | 7.567 | 8.847
(40, 40, 40) | 16 | uint8 | edge | F | 0.1997 | 0.0331 | 6.035 | 9.161
(40, 40, 40) | 16 | uint8 | symmetric | F | 0.1964 | 0.0349 | 5.622 | 8.393
(40, 40, 40) | 16 | uint8 | reflect | F | 0.1967 | 0.0339 | 5.793 | 8.808
(40, 40, 40) | 16 | uint8 | wrap | F | 0.1924 | 0.0334 | 5.762 | 8.909
(100, 100, 100) | 2 | uint8 | edge | C | 0.2042 | 0.0288 | 7.101 | 9.676
(100, 100, 100) | 2 | uint8 | symmetric | C | 0.1994 | 0.0317 | 6.294 | 9.334
(100, 100, 100) | 2 | uint8 | reflect | C | 0.2007 | 0.0302 | 6.634 | 9.287
(100, 100, 100) | 2 | uint8 | wrap | C | 0.1946 | 0.0308 | 6.315 | 9.179
(100, 100, 100) | 16 | uint8 | edge | C | 0.2023 | 0.0369 | 5.483 | 9.468
(100, 100, 100) | 16 | uint8 | symmetric | C | 0.2012 | 0.0451 | 4.465 | 9.197
(100, 100, 100) | 16 | uint8 | reflect | C | 0.2006 | 0.0411 | 4.886 | 9.073
(100, 100, 100) | 16 | uint8 | wrap | C | 0.1958 | 0.0429 | 4.561 | 9.060
(100, 100, 100) | 2 | uint8 | edge | F | 0.1996 | 0.0630 | 3.167 | 9.158
(100, 100, 100) | 2 | uint8 | symmetric | F | 0.1962 | 0.0636 | 3.084 | 8.816
(100, 100, 100) | 2 | uint8 | reflect | F | 0.1957 | 0.0638 | 3.068 | 8.725
(100, 100, 100) | 2 | uint8 | wrap | F | 0.1908 | 0.0636 | 2.999 | 8.719
(100, 100, 100) | 16 | uint8 | edge | F | 0.2041 | 0.0995 | 2.052 | 9.048
(100, 100, 100) | 16 | uint8 | symmetric | F | 0.2055 | 0.1039 | 1.978 | 8.858
(100, 100, 100) | 16 | uint8 | reflect | F | 0.2040 | 0.1038 | 1.965 | 8.821
(100, 100, 100) | 16 | uint8 | wrap | F | 0.1989 | 0.1071 | 1.858 | 8.757
(256, 256, 256) | 2 | uint8 | edge | C | 0.2063 | 0.1495 | 1.380 | 9.652
(256, 256, 256) | 2 | uint8 | symmetric | C | 0.2065 | 0.1613 | 1.280 | 9.647
(256, 256, 256) | 2 | uint8 | reflect | C | 0.2055 | 0.1540 | 1.334 | 9.328
(256, 256, 256) | 2 | uint8 | wrap | C | 0.1997 | 0.1569 | 1.273 | 9.326
(256, 256, 256) | 16 | uint8 | edge | C | 0.2090 | 0.1973 | 1.060 | 9.704
(256, 256, 256) | 16 | uint8 | symmetric | C | 0.2113 | 0.2419 | 0.873 | 9.573
(256, 256, 256) | 16 | uint8 | reflect | C | 0.2131 | 0.2124 | 1.003 | 9.351
(256, 256, 256) | 16 | uint8 | wrap | C | 0.2076 | 0.2311 | 0.899 | 9.410

Authors:
  - Gregory Lee (https://github.com/grlee77)
  - https://github.com/jakirkham

Approvers:
  - Gigon Bae (https://github.com/gigony)

URL: #482
  • Loading branch information
grlee77 authored Feb 2, 2023
1 parent 4b19225 commit 7fd07d0
Show file tree
Hide file tree
Showing 23 changed files with 1,132 additions and 65 deletions.
3 changes: 2 additions & 1 deletion python/cucim/src/cucim/core/operations/morphology/_pba_2d.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@

import cupy

from cucim.skimage._vendored import pad
from cucim.skimage._vendored._ndimage_util import _get_inttype

try:
Expand Down Expand Up @@ -352,7 +353,7 @@ def _pba_2d(arr, sampling=None, return_distances=True, return_indices=False,
orig_sy, orig_sx = arr.shape
padding_width = _determine_padding(arr.shape, padded_size, block_size)
if padding_width is not None:
arr = cupy.pad(arr, padding_width, mode="constant", constant_values=1)
arr = pad(arr, padding_width, mode="constant", constant_values=1)
size = arr.shape[0]

input_arr = _pack_int2(arr, marker=marker, int_dtype=int_dtype)
Expand Down
3 changes: 2 additions & 1 deletion python/cucim/src/cucim/core/operations/morphology/_pba_3d.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
import cupy
import numpy as np

from cucim.skimage._vendored import pad
from cucim.skimage._vendored._ndimage_util import _get_inttype

from ._pba_2d import (_check_distances, _check_indices,
Expand Down Expand Up @@ -366,7 +367,7 @@ def _pba_3d(arr, sampling=None, return_distances=True, return_indices=False,
arr.shape, block_size, m1, m2, m3, blockx, blocky
)
if padding_width is not None:
arr = cupy.pad(arr, padding_width, mode="constant", constant_values=1)
arr = pad(arr, padding_width, mode="constant", constant_values=1)
size = arr.shape[0]

# pba algorithm was implemented to use 32-bit integer to store compressed
Expand Down
1 change: 1 addition & 0 deletions python/cucim/src/cucim/skimage/_vendored/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@
"""

from cucim.skimage._vendored._pearsonr import pearsonr
from cucim.skimage._vendored.pad import pad
from cucim.skimage._vendored.signaltools import * # noqa
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,7 @@
from cucim.skimage._vendored import \
_ndimage_spline_prefilter_core as _spline_prefilter_core
from cucim.skimage._vendored import _ndimage_util as _util
from cucim.skimage._vendored import pad
from cucim.skimage._vendored._internal import _normalize_axis_index, prod


Expand Down Expand Up @@ -223,7 +224,7 @@ def _prepad_for_spline_filter(input, mode, cval):
kwargs = dict(mode='constant', constant_values=cval)
else:
kwargs = dict(mode='edge')
padded = cupy.pad(input, npad, **kwargs)
padded = pad(input, npad, **kwargs)
else:
npad = 0
padded = input
Expand All @@ -236,7 +237,7 @@ def _filter_input(image, prefilter, mode, cval, order):
Spline orders > 1 need a prefiltering stage to preserve resolution.
For boundary modes without analytical spline boundary conditions, some
prepadding of the input with cupy.pad is used to maintain accuracy.
prepadding of the input with pad is used to maintain accuracy.
``npad`` is an integer corresponding to the amount of padding at each edge
of the array.
"""
Expand Down Expand Up @@ -289,8 +290,8 @@ def map_coordinates(input, coordinates, output=None, order=3,
_check_parameter('map_coordinates', order, mode)

if mode == 'opencv' or mode == '_opencv_edge':
input = cupy.pad(input, [(1, 1)] * input.ndim, 'constant',
constant_values=cval)
input = pad(input, [(1, 1)] * input.ndim, 'constant',
constant_values=cval)
coordinates = cupy.add(coordinates, 1)
mode = 'constant'

Expand Down
Loading

0 comments on commit 7fd07d0

Please sign in to comment.