Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update IPEX MultiNode Docs #228

Merged
merged 1 commit into from
Jul 8, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 69 additions & 70 deletions pytorch/README.md
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this might look a little weird, but this is accurate, deepspeed isn't compatible with certain versions of torch + idp

Original file line number Diff line number Diff line change
Expand Up @@ -105,16 +105,18 @@ After running the command above, copy the URL (something like `http://127.0.0.1:

The images below additionally include [Intel® oneAPI Collective Communications Library] (oneCCL) and Neural Compressor ([INC]):

| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
| `2.2.0-pip-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
| `2.1.0-pip-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
| `2.0.0-pip-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | -------------- |
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
| `2.2.0-pip-multinode` | [v2.2.2] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.6] | [v0.4.0-Beta] |
| `2.1.100-pip-mulitnode` | [v2.1.2] | [v2.1.100+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.6] | [v0.4.0-Beta] |
| `2.0.100-pip-multinode` | [v2.0.1] | [v2.0.100+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.6] | [v0.4.0-Beta] |

> [!NOTE]
> Passwordless SSH connection is also enabled in the image, but the container does not contain any SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.

> **Note:** Passwordless SSH connection is also enabled in the image.
> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
> Since the SSH key is not owned by default user account in docker, please also do "chmod 600 authorized_keys; chmod 600 id_rsa" to grant read access for default user account.
> [!TIP]
> Before mounting any keys, modify the permissions of those files with `chmod 600 authorized_keys; chmod 600 id_rsa` to grant read access for the default user account.

#### Setup and Run IPEX Multi-Node Container

Expand All @@ -132,30 +134,52 @@ To add these files correctly please follow the steps described below.

1. Setup ID Keys

You can use the commands provided below to [generate the Identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
You can use the commands provided below to [generate the identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.

```bash
ssh-keygen -q -N "" -t rsa -b 4096 -f ./id_rsa
touch authorized_keys
cat id_rsa.pub >> authorized_keys
```

2. Configure the permissions and ownership for all of the files you have created so far.
2. Configure the permissions and ownership for all of the files you have created so far

```bash
chmod 600 id_rsa config authorized_keys
chown root:root id_rsa.pub id_rsa config authorized_keys
```

3. Setup hostfile. The hostfile is needed for running torch distributed using `ipexrun` utility. If you're not using `ipexrun` you can skip this step.
3. Create a hostfile for `torchrun` or `ipexrun`. (Optional)

```txt
<Host 1 IP/Hostname>
<Host 2 IP/Hostname>
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
...
```

4. Now start the workers and execute DDP on the launcher.
4. Configure [Intel® oneAPI Collective Communications Library] in your python script

```python
import oneccl_bindings_for_pytorch
import os

dist.init_process_group(
backend="ccl",
init_method="tcp://127.0.0.1:3022",
world_size=int(os.environ.get("WORLD_SIZE")),
rank=int(os.environ.get("RANK")),
)
```

5. Now start the workers and execute DDP on the launcher

1. Worker run command:

Expand All @@ -182,65 +206,36 @@ To add these files correctly please follow the steps described below.
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port 3022 /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```

5. Start SSH server with a custom port.
If the user wants to define their own port to start the SSH server, it can be done so using the commands described below.

1. Worker command:

```bash
export SSH_PORT=<User SSH Port>
docker run -it --rm \
--net=host \
-v $PWD/authorized_keys:/etc/ssh/authorized_keys \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c '/usr/sbin/sshd -D -p ${SSH_PORT}'
```

2. Add hosts to config. (**Note:** This is an optional step)

User can optionally mount their own custom client config file to define a list of hosts and ports where the SSH server is running inside the container. An example of a hostfile is provided below. This file is supposed to be mounted in the launcher container at `/etc/ssh/ssh_config`.
> [!NOTE]
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.

```bash
touch config
```
#### Enable [DeepSpeed*] optimizations

```txt
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
...
```
To enable [DeepSpeed*] optimizations with [Intel® oneAPI Collective Communications Library], add the following to your python script:

3. Launcher run command:
```python
import deepspeed

```bash
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/config:/etc/ssh/ssh_config \
-v $PWD/hostfile:/workspace/hostfile \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port ${SSH_PORT} /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```
# Rather than dist.init_process_group(), use deepspeed.init_distributed()
deepspeed.init_distributed(backend="ccl")
```

> [!NOTE]
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
Additionally, if you have a [DeepSpeed* configuration](https://www.deepspeed.ai/getting-started/#deepspeed-configuration) you can use the below command as your launcher to run your script with that configuration:

> [!TIP]
> Additionally, [DeepSpeed*] optimizations can be utilized in place of ipexrun with the `ccl` backend for multi-node training.
```bash
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/tests:/workspace/tests \
-v $PWD/hostfile:/workspace/hostfile \
-v $PWD/ds_config.json:/workspace/ds_config.json \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'deepspeed --launcher IMPI \
--master_addr 127.0.0.1 --master_port 3022 \
--deepspeed_config ds_config.json --hostfile /workspace/hostfile \
/workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl --deepspeed'
```

---

Expand Down Expand Up @@ -277,7 +272,7 @@ The images below additionally include [Intel® oneAPI Collective Communications

| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
| `2.2.0-idp-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
| `2.1.0-idp-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
| `2.0.0-idp-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
Expand Down Expand Up @@ -354,19 +349,23 @@ It is the image user's responsibility to ensure that any use of The images below
[v2.0.110+xpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu

[v2.3.0]: https://github.com/pytorch/pytorch/releases/tag/v2.3.0
[v2.2.2]: https://github.com/pytorch/pytorch/releases/tag/v2.2.2
[v2.2.0]: https://github.com/pytorch/pytorch/releases/tag/v2.2.0
[v2.1.2]: https://github.com/pytorch/pytorch/releases/tag/v2.1.2
[v2.1.0]: https://github.com/pytorch/pytorch/releases/tag/v2.1.0
[v2.0.1]: https://github.com/pytorch/pytorch/releases/tag/v2.0.1
[v2.0.0]: https://github.com/pytorch/pytorch/releases/tag/v2.0.0

[v2.5.1]: https://github.com/intel/neural-compressor/releases/tag/v2.5.1
[v2.6]: https://github.com/intel/neural-compressor/releases/tag/v2.6
[v2.4.1]: https://github.com/intel/neural-compressor/releases/tag/v2.4.1
[v2.3.1]: https://github.com/intel/neural-compressor/releases/tag/v2.3.1
[v2.1.1]: https://github.com/intel/neural-compressor/releases/tag/v2.1.1

[v2.3.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.3.0%2Bcpu
[v2.2.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.2.0%2Bcpu
[v2.1.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
[v2.1.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
[v2.0.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
[v2.0.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu

[ccl-v2.3.0]: https://github.com/intel/torch-ccl/releases/tag/v2.3.0%2Bcpu
Expand Down