Skip to content

Commit

Permalink
Update IPEX MultiNode Docs (#228)
Browse files Browse the repository at this point in the history
Signed-off-by: tylertitsworth <[email protected]>
Signed-off-by: Dina Suehiro Jones <[email protected]>
  • Loading branch information
Tyler Titsworth authored and dmsuehir committed Jul 12, 2024
1 parent 243d10b commit 79dbc8a
Showing 1 changed file with 69 additions and 70 deletions.
139 changes: 69 additions & 70 deletions pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,16 +105,18 @@ After running the command above, copy the URL (something like `http://127.0.0.1:

The images below additionally include [Intel® oneAPI Collective Communications Library] (oneCCL) and Neural Compressor ([INC]):

| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
| `2.2.0-pip-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
| `2.1.0-pip-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
| `2.0.0-pip-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | -------------- |
| `2.3.0-pip-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
| `2.2.0-pip-multinode` | [v2.2.2] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.6] | [v0.4.0-Beta] |
| `2.1.100-pip-mulitnode` | [v2.1.2] | [v2.1.100+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.6] | [v0.4.0-Beta] |
| `2.0.100-pip-multinode` | [v2.0.1] | [v2.0.100+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.6] | [v0.4.0-Beta] |

> [!NOTE]
> Passwordless SSH connection is also enabled in the image, but the container does not contain any SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
> **Note:** Passwordless SSH connection is also enabled in the image.
> The container does not contain the SSH ID keys. The user needs to mount those keys at `/root/.ssh/id_rsa` and `/etc/ssh/authorized_keys`.
> Since the SSH key is not owned by default user account in docker, please also do "chmod 600 authorized_keys; chmod 600 id_rsa" to grant read access for default user account.
> [!TIP]
> Before mounting any keys, modify the permissions of those files with `chmod 600 authorized_keys; chmod 600 id_rsa` to grant read access for the default user account.
#### Setup and Run IPEX Multi-Node Container

Expand All @@ -132,30 +134,52 @@ To add these files correctly please follow the steps described below.

1. Setup ID Keys

You can use the commands provided below to [generate the Identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.
You can use the commands provided below to [generate the identity keys](https://www.ssh.com/academy/ssh/keygen#creating-an-ssh-key-pair-for-user-authentication) for OpenSSH.

```bash
ssh-keygen -q -N "" -t rsa -b 4096 -f ./id_rsa
touch authorized_keys
cat id_rsa.pub >> authorized_keys
```

2. Configure the permissions and ownership for all of the files you have created so far.
2. Configure the permissions and ownership for all of the files you have created so far

```bash
chmod 600 id_rsa config authorized_keys
chown root:root id_rsa.pub id_rsa config authorized_keys
```

3. Setup hostfile. The hostfile is needed for running torch distributed using `ipexrun` utility. If you're not using `ipexrun` you can skip this step.
3. Create a hostfile for `torchrun` or `ipexrun`. (Optional)

```txt
<Host 1 IP/Hostname>
<Host 2 IP/Hostname>
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
...
```

4. Now start the workers and execute DDP on the launcher.
4. Configure [Intel® oneAPI Collective Communications Library] in your python script

```python
import oneccl_bindings_for_pytorch
import os
dist.init_process_group(
backend="ccl",
init_method="tcp://127.0.0.1:3022",
world_size=int(os.environ.get("WORLD_SIZE")),
rank=int(os.environ.get("RANK")),
)
```

5. Now start the workers and execute DDP on the launcher

1. Worker run command:

Expand All @@ -182,65 +206,36 @@ To add these files correctly please follow the steps described below.
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port 3022 /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```

5. Start SSH server with a custom port.
If the user wants to define their own port to start the SSH server, it can be done so using the commands described below.
1. Worker command:
```bash
export SSH_PORT=<User SSH Port>
docker run -it --rm \
--net=host \
-v $PWD/authorized_keys:/etc/ssh/authorized_keys \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c '/usr/sbin/sshd -D -p ${SSH_PORT}'
```
2. Add hosts to config. (**Note:** This is an optional step)
User can optionally mount their own custom client config file to define a list of hosts and ports where the SSH server is running inside the container. An example of a hostfile is provided below. This file is supposed to be mounted in the launcher container at `/etc/ssh/ssh_config`.
> [!NOTE]
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.

```bash
touch config
```
#### Enable [DeepSpeed*] optimizations

```txt
Host host1
HostName <Hostname of host1>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
Host host2
HostName <Hostname of host2>
IdentitiesOnly yes
IdentityFile ~/.root/id_rsa
Port <SSH Port>
...
```
To enable [DeepSpeed*] optimizations with [Intel® oneAPI Collective Communications Library], add the following to your python script:

3. Launcher run command:
```python
import deepspeed
```bash
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/config:/etc/ssh/ssh_config \
-v $PWD/hostfile:/workspace/hostfile \
-v $PWD/tests:/workspace/tests \
-e SSH_PORT=${SSH_PORT} \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'ipexrun cpu --nnodes 2 --nprocs-per-node 1 --master-addr 127.0.0.1 --master-port ${SSH_PORT} /workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl'
```
# Rather than dist.init_process_group(), use deepspeed.init_distributed()
deepspeed.init_distributed(backend="ccl")
```

> [!NOTE]
> [Intel® MPI] can be configured based on your machine settings. If the above commands do not work for you, see the documentation for how to configure based on your network.
Additionally, if you have a [DeepSpeed* configuration](https://www.deepspeed.ai/getting-started/#deepspeed-configuration) you can use the below command as your launcher to run your script with that configuration:

> [!TIP]
> Additionally, [DeepSpeed*] optimizations can be utilized in place of ipexrun with the `ccl` backend for multi-node training.
```bash
docker run -it --rm \
--net=host \
-v $PWD/id_rsa:/root/.ssh/id_rsa \
-v $PWD/tests:/workspace/tests \
-v $PWD/hostfile:/workspace/hostfile \
-v $PWD/ds_config.json:/workspace/ds_config.json \
-w /workspace \
intel/intel-extension-for-pytorch:2.3.0-pip-multinode \
bash -c 'deepspeed --launcher IMPI \
--master_addr 127.0.0.1 --master_port 3022 \
--deepspeed_config ds_config.json --hostfile /workspace/hostfile \
/workspace/tests/ipex-resnet50.py --ipex --device cpu --backend ccl --deepspeed'
```

---

Expand Down Expand Up @@ -277,7 +272,7 @@ The images below additionally include [Intel® oneAPI Collective Communications

| Tag(s) | Pytorch | IPEX | oneCCL | INC | Dockerfile |
| --------------------- | -------- | ------------ | -------------------- | --------- | --------------- |
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.5.1] | [v0.4.0-Beta] |
| `2.3.0-idp-multinode` | [v2.3.0] | [v2.3.0+cpu] | [v2.3.0][ccl-v2.3.0] | [v2.6] | [v0.4.0-Beta] |
| `2.2.0-idp-multinode` | [v2.2.0] | [v2.2.0+cpu] | [v2.2.0][ccl-v2.2.0] | [v2.4.1] | [v0.3.4] |
| `2.1.0-idp-mulitnode` | [v2.1.0] | [v2.1.0+cpu] | [v2.1.0][ccl-v2.1.0] | [v2.3.1] | [v0.2.3] |
| `2.0.0-idp-multinode` | [v2.0.0] | [v2.0.0+cpu] | [v2.0.0][ccl-v2.0.0] | [v2.1.1] | [v0.1.0] |
Expand Down Expand Up @@ -354,19 +349,23 @@ It is the image user's responsibility to ensure that any use of The images below
[v2.0.110+xpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.110%2Bxpu
[v2.3.0]: https://github.com/pytorch/pytorch/releases/tag/v2.3.0
[v2.2.2]: https://github.com/pytorch/pytorch/releases/tag/v2.2.2
[v2.2.0]: https://github.com/pytorch/pytorch/releases/tag/v2.2.0
[v2.1.2]: https://github.com/pytorch/pytorch/releases/tag/v2.1.2
[v2.1.0]: https://github.com/pytorch/pytorch/releases/tag/v2.1.0
[v2.0.1]: https://github.com/pytorch/pytorch/releases/tag/v2.0.1
[v2.0.0]: https://github.com/pytorch/pytorch/releases/tag/v2.0.0
[v2.5.1]: https://github.com/intel/neural-compressor/releases/tag/v2.5.1
[v2.6]: https://github.com/intel/neural-compressor/releases/tag/v2.6
[v2.4.1]: https://github.com/intel/neural-compressor/releases/tag/v2.4.1
[v2.3.1]: https://github.com/intel/neural-compressor/releases/tag/v2.3.1
[v2.1.1]: https://github.com/intel/neural-compressor/releases/tag/v2.1.1
[v2.3.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.3.0%2Bcpu
[v2.2.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.2.0%2Bcpu
[v2.1.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
[v2.1.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.1.0%2Bcpu
[v2.0.100+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
[v2.0.0+cpu]: https://github.com/intel/intel-extension-for-pytorch/releases/tag/v2.0.0%2Bcpu
[ccl-v2.3.0]: https://github.com/intel/torch-ccl/releases/tag/v2.3.0%2Bcpu
Expand Down

0 comments on commit 79dbc8a

Please sign in to comment.