Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fast FBGEMM path KT.regroup_as (#1910)
Summary: Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup. Benchmark Results [Forward] [fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0 [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0 Benchmark Results [Fowrard + Backward] [prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0 [prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0 [prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0 [prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0 [prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0 [prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0 [fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0 [prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0 Reviewed By: PaulZhang12 Differential Revision: D56392296
- Loading branch information