Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training with half-precision doesn't work for the torch tile or CUDA bindings #623

Closed
coreylammie opened this issue Mar 8, 2024 · 10 comments
Labels
bug Something isn't working

Comments

@coreylammie
Copy link
Contributor

Description

Training with half-precision doesn't work for the torch tile or CUDA bindings, e..g, when with torch.autocast(device_type="cuda", dtype=torch.bfloat16): is used in conjunction with rpu_config.runtime.data_type = RPUDataType.HALF.

How to reproduce

Convert a model with either TorchInferenceRPUConfig() or TorchInferenceRPUConfig(), specify rpu_config.runtime.data_type = RPUDataType.HALF, and use with torch.autocast(device_type="cuda", dtype=torch.bfloat16): in the training loop.

Expected behavior

https://aihwkit.readthedocs.io/en/latest/api/aihwkit.simulator.parameters.enums.html#aihwkit.simulator.parameters.enums.RPUDataType infers that this is supported.

@coreylammie coreylammie added the bug Something isn't working label Mar 8, 2024
@jubueche
Copy link
Collaborator

jubueche commented Mar 8, 2024

I see only fp16 in the Docs that you linked, but you are doing bf16. Does that explain it? And what exactly fails? Can you give a MWE?

@coreylammie
Copy link
Contributor Author

coreylammie commented Mar 8, 2024

@jubueche no, neither work. The error is different for bfloat16, float16, and for the torch tile and CUDA bindings.

MWE:

import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = InferenceRPUConfig() # or TorchInferenceRPUConfig().
    rpu_config.runtime.data_type = RPUDataType.HALF
    model = convert_to_analog(model, rpu_config) 
    transform=transforms.Compose([
    transforms.ToTensor(),
    transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model.to(device)
    model.train()
    for batch_idx, (data, target) in enumerate(train_loader):
        with torch.autocast(device_type="cuda", dtype=torch.float16): # or bfloat16
            data, target = data.to(device), target.to(device)
            optimizer.zero_grad()
            output = model(data)
            loss = F.nll_loss(output, target)
            loss.backward()

As far as I am aware, we don't have any examples for half precision, so I'm not particularly surprised it doesn't work.

@jubueche
Copy link
Collaborator

jubueche commented Mar 8, 2024

I see that there are some compile options

option(RPU_USE_FP16 "EXPERIMENTAL: Build FP16 support (only available with CUDA)" OFF)
option(RPU_USE_DOUBLE "EXPERIMENTAL: Build DOUBLE support" OFF)
option(RPU_PARAM_FP16 "EXPERIMENTAL: Use FP16 for (4 + 2) CUDA params" OFF)
option(RPU_BFLOAT_AS_FP16 "EXPERIMENTAL: Use bfloat instead of half for FP16 (only supported for A100+, CUDA 12)" OFF)

@maljoras maybe you know how to enable that?
I can confirm that this currently does not work with the torch tile. I will look into it.

@kaoutar55
Copy link
Collaborator

@coreylammie in which GPUs did you try this one?

@coreylammie
Copy link
Contributor Author

@coreylammie in which GPUs did you try this one?

A100_80GB. Once we do figure this out, it would be great to add an example for it. I intend on adding an example for MobileBERT/SQuAD anyway, so perhaps we can add a single example using half-precision support for this network/task.

@jubueche
Copy link
Collaborator

jubueche commented Mar 8, 2024

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

@jubueche
Copy link
Collaborator

jubueche commented Mar 8, 2024

@coreylammie could you use the branch above and see if all tests pass on GPU and you can run the example above? Also, feel free to enter the autocast again. Just remove the .float() cast on the output before you feed into the NLL loss.

@coreylammie
Copy link
Contributor Author

@coreylammie Note that your MWE was not even training in FP32. I have changed it to the below:

import tqdm
import torch
import torch.nn as nn
import torch.nn.functional as F
from torchvision import datasets, transforms
from aihwkit.simulator.configs import InferenceRPUConfig, TorchInferenceRPUConfig
from aihwkit.nn.conversion import convert_to_analog
from aihwkit.optim import AnalogSGD
from aihwkit.simulator.parameters.enums import RPUDataType

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(1, 32, 3, 1)
        self.conv2 = nn.Conv2d(32, 64, 3, 1)
        self.dropout1 = nn.Dropout(0.25)
        self.dropout2 = nn.Dropout(0.5)
        self.fc1 = nn.Linear(9216, 128)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = self.conv1(x)
        x = F.relu(x)
        x = self.conv2(x)
        x = F.relu(x)
        x = F.max_pool2d(x, 2)
        x = self.dropout1(x)
        x = torch.flatten(x, 1)
        x = self.fc1(x)
        x = F.relu(x)
        x = self.dropout2(x)
        x = self.fc2(x)
        output = F.log_softmax(x, dim=1)
        return output
    
if __name__ == "__main__":
    model = Net()
    rpu_config = TorchInferenceRPUConfig()
    model = convert_to_analog(model, rpu_config) 
    nll_loss = torch.nn.NLLLoss()
    transform=transforms.Compose([
        transforms.ToTensor(),
        transforms.Normalize((0.1307,), (0.3081,))
    ])
    dataset = datasets.MNIST('data', train=True, download=True, transform=transform)
    train_loader = torch.utils.data.DataLoader(dataset, batch_size=32)
    
    model = model.to(device=device, dtype=torch.bfloat16)
    optimizer = AnalogSGD(model.parameters(), lr=0.1)
    model = model.train()
    
    pbar = tqdm.tqdm(enumerate(train_loader))
    for batch_idx, (data, target) in pbar:
        data, target = data.to(device=device, dtype=torch.bfloat16), target.to(device=device)
        optimizer.zero_grad()
        output = model(data)
        loss = F.nll_loss(output.float(), target)
        loss.backward()
        optimizer.step()
        pbar.set_description(f"Loss {loss:.4f}")

Ideally, this should train. The autocast is unfortunately only supported for CUDA.

@jubueche first, the MWE was not intended to train. It was indended to reproduce the errror, which is raised when loss.backward() is called. Second, the documentation here https://pytorch.org/docs/stable/amp.html#cpu-op-specific-behavior seems to indicate CPU is also supported by autocast. In an example, the following code is listed:


# Creates model and optimizer in default precision
model = Net()
optimizer = optim.SGD(model.parameters(), ...)

for epoch in epochs:
    for input, target in data:
        optimizer.zero_grad()

        # Runs the forward pass with autocasting.
        with torch.autocast(device_type="cpu", dtype=torch.bfloat16):
            output = model(input)
            loss = loss_fn(output, target)

        loss.backward()
        optimizer.step()

Are you sure this is not supported?

@jubueche
Copy link
Collaborator

jubueche commented Mar 8, 2024

I see. Maybe I forgot to set something. I will check soon. In the meantime, can you check if it runs on GPU in my PR?

@kaoutar55
Copy link
Collaborator

Need to document and add an example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants