Training with half-precision doesn't work for the torch tile or CUDA bindings #623

coreylammie · 2024-03-08T15:34:12Z

Description

Training with half-precision doesn't work for the torch tile or CUDA bindings, e..g, when with torch.autocast(device_type="cuda", dtype=torch.bfloat16): is used in conjunction with rpu_config.runtime.data_type = RPUDataType.HALF.

How to reproduce

Convert a model with either TorchInferenceRPUConfig() or TorchInferenceRPUConfig(), specify rpu_config.runtime.data_type = RPUDataType.HALF, and use with torch.autocast(device_type="cuda", dtype=torch.bfloat16): in the training loop.

Expected behavior

https://aihwkit.readthedocs.io/en/latest/api/aihwkit.simulator.parameters.enums.html#aihwkit.simulator.parameters.enums.RPUDataType infers that this is supported.

The text was updated successfully, but these errors were encountered:

jubueche · 2024-03-08T16:23:48Z

I see only fp16 in the Docs that you linked, but you are doing bf16. Does that explain it? And what exactly fails? Can you give a MWE?

coreylammie · 2024-03-08T16:34:08Z

@jubueche no, neither work. The error is different for bfloat16, float16, and for the torch tile and CUDA bindings.

MWE:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = InferenceRPUConfig() # or TorchInferenceRPUConfig().
    rpu_config.runtime.data_type = RPUDataType.HALF
    model = convert_to_analog(model, rpu_config) 
    transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model.to(device)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        with torch.autocast(device_type="cuda", dtype=torch.float16): # or bfloat16
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()

As far as I am aware, we don't have any examples for half precision, so I'm not particularly surprised it doesn't work.

jubueche · 2024-03-08T16:40:45Z

I see that there are some compile options

option(RPU_USE_FP16 "EXPERIMENTAL: Build FP16 support (only available with CUDA)" OFF)
option(RPU_USE_DOUBLE "EXPERIMENTAL: Build DOUBLE support" OFF)
option(RPU_PARAM_FP16 "EXPERIMENTAL: Use FP16 for (4 + 2) CUDA params" OFF)
option(RPU_BFLOAT_AS_FP16 "EXPERIMENTAL: Use bfloat instead of half for FP16 (only supported for A100+, CUDA 12)" OFF)

@maljoras maybe you know how to enable that?
I can confirm that this currently does not work with the torch tile. I will look into it.

kaoutar55 · 2024-03-08T16:44:31Z

@coreylammie in which GPUs did you try this one?

coreylammie · 2024-03-08T17:16:19Z

@coreylammie in which GPUs did you try this one?

A100_80GB. Once we do figure this out, it would be great to add an example for it. I intend on adding an example for MobileBERT/SQuAD anyway, so perhaps we can add a single example using half-precision support for this network/task.

jubueche · 2024-03-08T17:48:59Z

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

jubueche · 2024-03-08T17:50:27Z

@coreylammie could you use the branch above and see if all tests pass on GPU and you can run the example above? Also, feel free to enter the autocast again. Just remove the .float() cast on the output before you feed into the NLL loss.

coreylammie · 2024-03-08T18:03:46Z

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

@jubueche first, the MWE was not intended to train. It was indended to reproduce the errror, which is raised when loss.backward() is called. Second, the documentation here https://pytorch.org/docs/stable/amp.html#cpu-op-specific-behavior seems to indicate CPU is also supported by autocast. In an example, the following code is listed:


# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

Are you sure this is not supported?

jubueche · 2024-03-08T19:48:18Z

I see. Maybe I forgot to set something. I will check soon. In the meantime, can you check if it runs on GPU in my PR?

kaoutar55 · 2024-03-14T15:32:20Z

Need to document and add an example.

coreylammie added the bug Something isn't working label Mar 8, 2024

jubueche mentioned this issue Mar 8, 2024

Fix the support of different dtypes for the torch model #625

Merged

kaoutar55 closed this as completed Jun 26, 2024

This was referenced Aug 12, 2024

Improve documentation and add an example of training with half precision #677

Open

feat(docs): add half-precision training section in using_simulator docs #678

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training with half-precision doesn't work for the torch tile or CUDA bindings #623

Training with half-precision doesn't work for the torch tile or CUDA bindings #623

coreylammie commented Mar 8, 2024

jubueche commented Mar 8, 2024

coreylammie commented Mar 8, 2024 •

edited

Loading

jubueche commented Mar 8, 2024

kaoutar55 commented Mar 8, 2024

coreylammie commented Mar 8, 2024

jubueche commented Mar 8, 2024

jubueche commented Mar 8, 2024

coreylammie commented Mar 8, 2024

jubueche commented Mar 8, 2024

kaoutar55 commented Mar 14, 2024

Training with half-precision doesn't work for the torch tile or CUDA bindings #623

Training with half-precision doesn't work for the torch tile or CUDA bindings #623

Comments

coreylammie commented Mar 8, 2024

Description

How to reproduce

Expected behavior

jubueche commented Mar 8, 2024

coreylammie commented Mar 8, 2024 • edited Loading

jubueche commented Mar 8, 2024

kaoutar55 commented Mar 8, 2024

coreylammie commented Mar 8, 2024

jubueche commented Mar 8, 2024

jubueche commented Mar 8, 2024

coreylammie commented Mar 8, 2024

jubueche commented Mar 8, 2024

kaoutar55 commented Mar 14, 2024

coreylammie commented Mar 8, 2024 •

edited

Loading