Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add imbalance stats for total perf, hbm and ddr #2040

Closed
wants to merge 1 commit into from

Conversation

sarckk
Copy link
Member

@sarckk sarckk commented May 24, 2024

Summary:
Add and log 4 ways to measure imbalance of generated sharding plan.

Suppose we have $k$ gpus. Let $s$ be a vector where the $i^{th}$ component is the total size currently allocated to the $i^{th}$ device.

First we normalize vector $s$ such that sum of elements in $p$ equals 1. We can view this as a probability distribution, and to get a measure of imbalance, we can measure its deviation from uniform distribution $p$ using one of the following ways:

  • total variations
  • total distance
  • chi divergence
  • KL divergence

Differential Revision: D57465383

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 24, 2024
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57465383

@sarckk sarckk force-pushed the export-D57465383 branch from bd37c33 to 07cb78f Compare May 30, 2024 19:53
sarckk added a commit to sarckk/torchrec that referenced this pull request May 30, 2024
Summary:

Add and log 4 ways to measure imbalance of generated sharding plan.

Suppose we have $k$ gpus. Let $s$ be a vector where the $i^{th}$ component is the total size currently allocated to the $i^{th}$ device.

First we normalize vector $s$ such that sum of elements in $p$ equals 1. We can view this as a probability distribution, and to get a measure of imbalance, we can measure its deviation from uniform distribution $p$ using one of the following ways:

- total variations
- total distance
- chi divergence
- KL divergence

Reviewed By: henrylhtsang

Differential Revision: D57465383
Summary:

Add and log 4 ways to measure imbalance of generated sharding plan.

Suppose we have $k$ gpus. Let $s$ be a vector where the $i^{th}$ component is the total size currently allocated to the $i^{th}$ device.

First we normalize vector $s$ such that sum of elements in $p$ equals 1. We can view this as a probability distribution, and to get a measure of imbalance, we can measure its deviation from uniform distribution $p$ using one of the following ways:

- total variations
- total distance
- chi divergence
- KL divergence

Reviewed By: henrylhtsang

Differential Revision: D57465383
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57465383

@sarckk sarckk force-pushed the export-D57465383 branch from 07cb78f to ea854cd Compare May 30, 2024 19:53
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D57465383

PaulZhang12 pushed a commit that referenced this pull request Jun 5, 2024
Summary:
Pull Request resolved: #2040

Add and log 4 ways to measure imbalance of generated sharding plan.

Suppose we have $k$ gpus. Let $s$ be a vector where the $i^{th}$ component is the total size currently allocated to the $i^{th}$ device.

First we normalize vector $s$ such that sum of elements in $p$ equals 1. We can view this as a probability distribution, and to get a measure of imbalance, we can measure its deviation from uniform distribution $p$ using one of the following ways:

- total variations
- total distance
- chi divergence
- KL divergence

Reviewed By: henrylhtsang

Differential Revision: D57465383

fbshipit-source-id: a22d59b17461dea46478b12beafe1b95c3331f62
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants