Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
use vendored version of cupy.pad with added performance optimizations (…
…#482) ## Overview This version provides faster, elementwise kernel implementations for common padding modes. It is under `_vendored` because most of `pad.py` is copied from CuPy itself. The only new part there is the `_use_elementwise_kernel` utility and the conditional branch where it evaluates to True. The newly written code is mostly in `pad_elementwise.py`. I could potentially further refactor `pad.py` to remove most of the code and just call out to `cupy.pad` instead whenever we aren't using the elementwise kernels. This version should also be submited upstream to CuPy itself. Padding performance is substantially improved for modes `edge`, `symmetric`, `reflect` and `wrap`. Most places in cuCIM where we use padding, it is not the bottleneck, but it should still provide a small performance improvement in several places. I ran some benchmarks, and the largest impact I saw was around 25% reduction in run-time for `chan_vese`. ## Benchmark Results (vs. `cupy.pad`) In the following, the next-to-last column is the overall acceleration observed. It is large for small 2D or 3D images (>5x) and becomes relatively small for larger images (e.g. ~10% for 4k images). The final column only relates to the amount of time spent on the host. That "accel. CPU" number always strongly favors the new implementation. It has lower host overhead because everything is done in a single kernel call rather than potentially using multiple kernels for each axis in turn. This kernel launch overhead explains why the overall benefit is much higher for the smaller image sizes. shape | pad_width | dtype | mode | order | duration, old (ms) | duration, new (ms) | accel. | accel. CPU ------|-----------|-------|------|-------|--------------------|--------------------|--------|----------- (256, 256) | 2 | uint8 | edge | C | 0.1278 | 0.0230 | 5.563 | 6.298 (256, 256) | 2 | uint8 | symmetric | C | 0.1286 | 0.0230 | 5.583 | 6.268 (256, 256) | 2 | uint8 | reflect | C | 0.1294 | 0.0236 | 5.479 | 6.165 (256, 256) | 2 | uint8 | wrap | C | 0.1246 | 0.0228 | 5.468 | 6.149 (256, 256) | 16 | uint8 | edge | C | 0.1276 | 0.0229 | 5.563 | 6.269 (256, 256) | 16 | uint8 | symmetric | C | 0.1305 | 0.0231 | 5.645 | 6.366 (256, 256) | 16 | uint8 | reflect | C | 0.1300 | 0.0235 | 5.539 | 6.220 (256, 256) | 16 | uint8 | wrap | C | 0.1270 | 0.0228 | 5.568 | 6.268 (256, 256) | 2 | uint8 | edge | F | 0.1300 | 0.0234 | 5.567 | 6.281 (256, 256) | 2 | uint8 | symmetric | F | 0.1291 | 0.0236 | 5.471 | 6.157 (256, 256) | 2 | uint8 | reflect | F | 0.1294 | 0.0238 | 5.427 | 6.080 (256, 256) | 2 | uint8 | wrap | F | 0.1254 | 0.0234 | 5.363 | 6.043 (256, 256) | 16 | uint8 | edge | F | 0.1279 | 0.0232 | 5.506 | 6.315 (256, 256) | 16 | uint8 | symmetric | F | 0.1294 | 0.0236 | 5.472 | 6.319 (256, 256) | 16 | uint8 | reflect | F | 0.1300 | 0.0239 | 5.434 | 6.262 (256, 256) | 16 | uint8 | wrap | F | 0.1262 | 0.0238 | 5.310 | 6.134 (1024, 1024) | 2 | uint8 | edge | C | 0.1279 | 0.0255 | 5.020 | 6.287 (1024, 1024) | 2 | uint8 | symmetric | C | 0.1285 | 0.0258 | 4.980 | 6.259 (1024, 1024) | 2 | uint8 | reflect | C | 0.1286 | 0.0263 | 4.888 | 6.118 (1024, 1024) | 2 | uint8 | wrap | C | 0.1253 | 0.0255 | 4.905 | 6.170 (1024, 1024) | 16 | uint8 | edge | C | 0.1277 | 0.0258 | 4.947 | 6.270 (1024, 1024) | 16 | uint8 | symmetric | C | 0.1286 | 0.0261 | 4.931 | 6.296 (1024, 1024) | 16 | uint8 | reflect | C | 0.1280 | 0.0264 | 4.845 | 6.132 (1024, 1024) | 16 | uint8 | wrap | C | 0.1249 | 0.0260 | 4.798 | 6.095 (1024, 1024) | 2 | uint8 | edge | F | 0.1289 | 0.0581 | 2.217 | 6.084 (1024, 1024) | 2 | uint8 | symmetric | F | 0.1304 | 0.0586 | 2.227 | 6.064 (1024, 1024) | 2 | uint8 | reflect | F | 0.1331 | 0.0590 | 2.257 | 6.059 (1024, 1024) | 2 | uint8 | wrap | F | 0.1278 | 0.0586 | 2.180 | 5.994 (1024, 1024) | 16 | uint8 | edge | F | 0.1299 | 0.0604 | 2.149 | 6.238 (1024, 1024) | 16 | uint8 | symmetric | F | 0.1315 | 0.0607 | 2.168 | 6.255 (1024, 1024) | 16 | uint8 | reflect | F | 0.1309 | 0.0614 | 2.133 | 6.070 (1024, 1024) | 16 | uint8 | wrap | F | 0.1275 | 0.0606 | 2.103 | 6.105 (4096, 4096) | 2 | uint8 | edge | C | 0.1291 | 0.1143 | 1.130 | 6.202 (4096, 4096) | 2 | uint8 | symmetric | C | 0.1296 | 0.1132 | 1.145 | 6.183 (4096, 4096) | 2 | uint8 | reflect | C | 0.1295 | 0.1151 | 1.125 | 6.064 (4096, 4096) | 2 | uint8 | wrap | C | 0.1266 | 0.1138 | 1.112 | 6.029 (4096, 4096) | 16 | uint8 | edge | C | 0.1295 | 0.1157 | 1.119 | 6.212 (4096, 4096) | 16 | uint8 | symmetric | C | 0.1301 | 0.1150 | 1.131 | 6.208 (4096, 4096) | 16 | uint8 | reflect | C | 0.1302 | 0.1168 | 1.115 | 6.088 (4096, 4096) | 16 | uint8 | wrap | C | 0.1272 | 0.1153 | 1.103 | 6.065 (4096, 4096) | 2 | uint8 | edge | F | 0.6624 | 0.6433 | 1.030 | 6.228 (4096, 4096) | 2 | uint8 | symmetric | F | 0.6639 | 0.6438 | 1.031 | 6.133 (4096, 4096) | 2 | uint8 | reflect | F | 0.6640 | 0.6441 | 1.031 | 6.003 (4096, 4096) | 2 | uint8 | wrap | F | 0.6638 | 0.6454 | 1.028 | 6.037 (4096, 4096) | 16 | uint8 | edge | F | 0.6909 | 0.6713 | 1.029 | 6.318 (4096, 4096) | 16 | uint8 | symmetric | F | 0.6915 | 0.6717 | 1.029 | 6.229 (4096, 4096) | 16 | uint8 | reflect | F | 0.6919 | 0.6724 | 1.029 | 6.082 (4096, 4096) | 16 | uint8 | wrap | F | 0.6923 | 0.6720 | 1.030 | 6.136 (40, 40, 40) | 2 | uint8 | edge | C | 0.2057 | 0.0239 | 8.610 | 9.765 (40, 40, 40) | 2 | uint8 | symmetric | C | 0.2014 | 0.0241 | 8.357 | 9.450 (40, 40, 40) | 2 | uint8 | reflect | C | 0.1999 | 0.0245 | 8.169 | 9.227 (40, 40, 40) | 2 | uint8 | wrap | C | 0.1969 | 0.0237 | 8.299 | 9.405 (40, 40, 40) | 16 | uint8 | edge | C | 0.2028 | 0.0235 | 8.633 | 9.760 (40, 40, 40) | 16 | uint8 | symmetric | C | 0.2000 | 0.0255 | 7.844 | 9.502 (40, 40, 40) | 16 | uint8 | reflect | C | 0.1988 | 0.0250 | 7.946 | 9.339 (40, 40, 40) | 16 | uint8 | wrap | C | 0.1948 | 0.0248 | 7.871 | 9.371 (40, 40, 40) | 2 | uint8 | edge | F | 0.1980 | 0.0248 | 7.994 | 9.322 (40, 40, 40) | 2 | uint8 | symmetric | F | 0.1963 | 0.0250 | 7.840 | 9.159 (40, 40, 40) | 2 | uint8 | reflect | F | 0.1952 | 0.0253 | 7.729 | 8.985 (40, 40, 40) | 2 | uint8 | wrap | F | 0.1898 | 0.0251 | 7.567 | 8.847 (40, 40, 40) | 16 | uint8 | edge | F | 0.1997 | 0.0331 | 6.035 | 9.161 (40, 40, 40) | 16 | uint8 | symmetric | F | 0.1964 | 0.0349 | 5.622 | 8.393 (40, 40, 40) | 16 | uint8 | reflect | F | 0.1967 | 0.0339 | 5.793 | 8.808 (40, 40, 40) | 16 | uint8 | wrap | F | 0.1924 | 0.0334 | 5.762 | 8.909 (100, 100, 100) | 2 | uint8 | edge | C | 0.2042 | 0.0288 | 7.101 | 9.676 (100, 100, 100) | 2 | uint8 | symmetric | C | 0.1994 | 0.0317 | 6.294 | 9.334 (100, 100, 100) | 2 | uint8 | reflect | C | 0.2007 | 0.0302 | 6.634 | 9.287 (100, 100, 100) | 2 | uint8 | wrap | C | 0.1946 | 0.0308 | 6.315 | 9.179 (100, 100, 100) | 16 | uint8 | edge | C | 0.2023 | 0.0369 | 5.483 | 9.468 (100, 100, 100) | 16 | uint8 | symmetric | C | 0.2012 | 0.0451 | 4.465 | 9.197 (100, 100, 100) | 16 | uint8 | reflect | C | 0.2006 | 0.0411 | 4.886 | 9.073 (100, 100, 100) | 16 | uint8 | wrap | C | 0.1958 | 0.0429 | 4.561 | 9.060 (100, 100, 100) | 2 | uint8 | edge | F | 0.1996 | 0.0630 | 3.167 | 9.158 (100, 100, 100) | 2 | uint8 | symmetric | F | 0.1962 | 0.0636 | 3.084 | 8.816 (100, 100, 100) | 2 | uint8 | reflect | F | 0.1957 | 0.0638 | 3.068 | 8.725 (100, 100, 100) | 2 | uint8 | wrap | F | 0.1908 | 0.0636 | 2.999 | 8.719 (100, 100, 100) | 16 | uint8 | edge | F | 0.2041 | 0.0995 | 2.052 | 9.048 (100, 100, 100) | 16 | uint8 | symmetric | F | 0.2055 | 0.1039 | 1.978 | 8.858 (100, 100, 100) | 16 | uint8 | reflect | F | 0.2040 | 0.1038 | 1.965 | 8.821 (100, 100, 100) | 16 | uint8 | wrap | F | 0.1989 | 0.1071 | 1.858 | 8.757 (256, 256, 256) | 2 | uint8 | edge | C | 0.2063 | 0.1495 | 1.380 | 9.652 (256, 256, 256) | 2 | uint8 | symmetric | C | 0.2065 | 0.1613 | 1.280 | 9.647 (256, 256, 256) | 2 | uint8 | reflect | C | 0.2055 | 0.1540 | 1.334 | 9.328 (256, 256, 256) | 2 | uint8 | wrap | C | 0.1997 | 0.1569 | 1.273 | 9.326 (256, 256, 256) | 16 | uint8 | edge | C | 0.2090 | 0.1973 | 1.060 | 9.704 (256, 256, 256) | 16 | uint8 | symmetric | C | 0.2113 | 0.2419 | 0.873 | 9.573 (256, 256, 256) | 16 | uint8 | reflect | C | 0.2131 | 0.2124 | 1.003 | 9.351 (256, 256, 256) | 16 | uint8 | wrap | C | 0.2076 | 0.2311 | 0.899 | 9.410 Authors: - Gregory Lee (https://github.com/grlee77) - https://github.com/jakirkham Approvers: - Gigon Bae (https://github.com/gigony) URL: #482
- Loading branch information