Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fast FBGEMM path KT.regroup_as #1910

Closed
wants to merge 1 commit into from

Conversation

dstaay-fb
Copy link
Contributor

Summary:
Use custom FBGEMM kernel when possible for inference/training. ~0-75% runtime speedup.

Benchmark Results [Forward]
[fallback] _regroup_keyed_tenors | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 24.0
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 36.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 1.9 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 4.6 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 1.3 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 13.2 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.3 ms | Memory (P90): 48.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 0.6 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 1.8 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 12.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 3.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 96.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 0.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.7 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.6 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 4.8 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 12.5 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.4 ms | Memory (P90): 192.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 0.8 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 0.9 ms | Memory (P90): 384.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 1.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 1.7 ms | Memory (P90): 768.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 2.8 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 4.1 ms | Memory (P90): 1536.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 5.6 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 3072.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
[prod] KeyedTensor.regroup | B: 512 | F: 80 | device: cuda | Runtime (P90): 2.2 ms | Memory (P90): 72.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 160 | device: cuda | Runtime (P90): 4.7 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 512 | F: 160 | device: cuda | Runtime (P90): 3.4 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 320 | device: cuda | Runtime (P90): 9.0 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 512 | F: 320 | device: cuda | Runtime (P90): 6.5 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 640 | device: cuda | Runtime (P90): 19.9 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 512 | F: 640 | device: cuda | Runtime (P90): 11.4 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 512 | F: 1280 | device: cuda | Runtime (P90): 46.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 512 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.6 ms | Memory (P90): 144.0
[prod] KeyedTensor.regroup | B: 1024 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 144.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 1024 | F: 160 | device: cuda | Runtime (P90): 3.9 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 320 | device: cuda | Runtime (P90): 8.8 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 1024 | F: 320 | device: cuda | Runtime (P90): 6.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 640 | device: cuda | Runtime (P90): 18.7 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 1024 | F: 640 | device: cuda | Runtime (P90): 12.2 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 42.8 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 1024 | F: 1280 | device: cuda | Runtime (P90): 23.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.5 ms | Memory (P90): 288.0
[prod] KeyedTensor.regroup | B: 2048 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 288.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.5 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 2048 | F: 160 | device: cuda | Runtime (P90): 4.2 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 320 | device: cuda | Runtime (P90): 8.9 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 2048 | F: 320 | device: cuda | Runtime (P90): 7.7 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 640 | device: cuda | Runtime (P90): 19.2 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 2048 | F: 640 | device: cuda | Runtime (P90): 12.9 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 45.1 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 2048 | F: 1280 | device: cuda | Runtime (P90): 26.4 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.4 ms | Memory (P90): 576.0
[prod] KeyedTensor.regroup | B: 4096 | F: 80 | device: cuda | Runtime (P90): 2.7 ms | Memory (P90): 576.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[prod] KeyedTensor.regroup | B: 4096 | F: 160 | device: cuda | Runtime (P90): 4.4 ms | Memory (P90): 1152.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.4 ms | Memory (P90): 2304.0
[prod] KeyedTensor.regroup | B: 4096 | F: 320 | device: cuda | Runtime (P90): 8.1 ms | Memory (P90): 2304.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 640 | device: cuda | Runtime (P90): 28.0 ms | Memory (P90): 4608.0
[prod] KeyedTensor.regroup | B: 4096 | F: 640 | device: cuda | Runtime (P90): 15.6 ms | Memory (P90): 4608.0
[fallback] _regroup_keyed_tenors | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 43.2 ms | Memory (P90): 9216.0
[prod] KeyedTensor.regroup | B: 4096 | F: 1280 | device: cuda | Runtime (P90): 31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 22, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56392296

dstaay-fb added a commit to dstaay-fb/torchrec that referenced this pull request Apr 23, 2024
Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56392296

dstaay-fb added a commit to dstaay-fb/torchrec that referenced this pull request Apr 23, 2024
Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Differential Revision: D56392296
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56392296

Summary:

Use custom FBGEMM kernel when possible for inference/training.  ~0-75% runtime speedup.

Benchmark Results [Forward]
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  24.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  36.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   1.9 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):   4.6 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):   1.3 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  13.2 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.3 ms | Memory (P90):  48.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   0.6 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   1.8 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  12.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):   3.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90):  96.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   0.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.7 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.6 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   4.8 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  12.5 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.4 ms | Memory (P90): 192.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   0.8 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   0.9 ms | Memory (P90): 384.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   1.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   1.7 ms | Memory (P90): 768.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   2.8 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   4.1 ms | Memory (P90): 1536.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):   5.6 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 3072.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  11.1 ms | Memory (P90): 4608.0

Benchmark Results [Fowrard + Backward]
  [prod] KeyedTensor.regroup          | B: 512      | F: 80       | device: cuda     | Runtime (P90):   2.2 ms | Memory (P90):  72.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 160      | device: cuda     | Runtime (P90):   4.7 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 160      | device: cuda     | Runtime (P90):   3.4 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 320      | device: cuda     | Runtime (P90):   9.0 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 320      | device: cuda     | Runtime (P90):   6.5 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 640      | device: cuda     | Runtime (P90):  19.9 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 640      | device: cuda     | Runtime (P90):  11.4 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  46.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 512      | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.6 ms | Memory (P90): 144.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 144.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 160      | device: cuda     | Runtime (P90):   3.9 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   8.8 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 320      | device: cuda     | Runtime (P90):   6.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  18.7 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 640      | device: cuda     | Runtime (P90):  12.2 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  42.8 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 1024     | F: 1280     | device: cuda     | Runtime (P90):  23.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.5 ms | Memory (P90): 288.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 288.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.5 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 160      | device: cuda     | Runtime (P90):   4.2 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   8.9 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 320      | device: cuda     | Runtime (P90):   7.7 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  19.2 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 640      | device: cuda     | Runtime (P90):  12.9 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  45.1 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 2048     | F: 1280     | device: cuda     | Runtime (P90):  26.4 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.4 ms | Memory (P90): 576.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 80       | device: cuda     | Runtime (P90):   2.7 ms | Memory (P90): 576.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 160      | device: cuda     | Runtime (P90):   4.4 ms | Memory (P90): 1152.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.4 ms | Memory (P90): 2304.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 320      | device: cuda     | Runtime (P90):   8.1 ms | Memory (P90): 2304.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  28.0 ms | Memory (P90): 4608.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 640      | device: cuda     | Runtime (P90):  15.6 ms | Memory (P90): 4608.0
  [fallback] _regroup_keyed_tenors    | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  43.2 ms | Memory (P90): 9216.0
  [prod] KeyedTensor.regroup          | B: 4096     | F: 1280     | device: cuda     | Runtime (P90):  31.2 ms | Memory (P90): 9216.0

Reviewed By: PaulZhang12

Differential Revision: D56392296
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D56392296

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants