-
-
Notifications
You must be signed in to change notification settings - Fork 16.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Strange behaviour with mps M1 #10178
Comments
@jgoo9410 yes there are silent errors in MPS inference, likely in the Detect() head. If you can help debug and trace the source of the differences that would help. I compared feature outputs into Detect and I believe they were identical. Perhaps anchor/grid tensors on different devices or dtypes might be the cause. |
Hmm. Ive had a look as similar bugs on here, but this one seems different. In some of the other bugs, using mps appears to generate multiple incorrect detections. In my case the detection is 'correct', just displaced. Interestingly though, if I allow the detections to continue, the majority of them end up being correct both in detection and in location, even in areas where objects were previously displaced. The problem posits over a few hundred frames, until eventually the detection bbox and the object converge and the detection becomes correct. The point is that there is definitely a pattern, and it appears only certain detections are bringing the issue to the fore. GDPR prevents me from uploading the footage here, but I'd be happy to share it privately. I'm happy to put in a shift to try and find the source of the inaccuracy, although I suspect its not an inherent yolov5 issue. |
@jgoo9410 one clue is that classification inference with MPS works correctly (same result as CPU), so this is why I say that the difference is likely in the Detect grids or anchor devices/dtypes. NMS itself has been converted to CPU when MPS is used as MPS torch ops are not fully supported there so I don't think NMS plays a part in the difference. I think you're right though that it's likely a torch bug rather than a YOLOv5 bug, but I do know we handle Detect grids/anchors a little strangely, i.e. using custom _apply() function here to make sure they respect module.to() ops. Lines 646 to 656 in a9f895d
|
Okay @glenn-jocher, thanks for the clue. It will take me a couple of days to get my head around the process but I'll report back. |
@jgoo9410 great! |
@glenn-jocher, debugging is probably a slight grandiose term for what I've been doing, but I've been making as many comparisons as I can. Not sure if any of them are going to be useful, but here are the key ones: In the Following the program flow, I've been examining the outputs function by function and I find that the first deviation comes from the The outputs are as follow: CPU: `(tensor([[[4.71834e+00, 4.22245e+00, 1.26789e+01, 1.28928e+01, 4.03146e-06, 9.99985e-01], (tensor([[[7.56392e+00, 7.16428e+00, 1.50140e+01, 1.45258e+01, 1.16115e-05, 9.99985e-01], MPS: (tensor([[[7.56392e+00, 7.16428e+00, 1.50140e+01, 1.45258e+01, 1.16116e-05, 9.99985e-01], Any thoughts? Is this a red herring? |
@jgoo9410 not sure what you mean by the first forward function output. You mean the very first convolution in the model? |
|
@jgoo9410 oh sorry, forward and forward_once are only called once for the model as a whole. The output of forward and forward once is the same as the output of the whole model. If I were you I'd print the inputs and values in Detect() with --device cpu and --device mps and start debugging there. |
@glenn-jocher Okay, I'll take another look. Do you have a flowchart I could reference? |
@jgoo9410 👋 Hello! Thanks for asking about YOLOv5 🚀 architecture visualization. We've made visualizing YOLOv5 🚀 architectures super easy. There are 3 main ways:
|
# YOLOv5 v6.0 backbone | |
backbone: | |
# [from, number, module, args] | |
[[-1, 1, Conv, [64, 6, 2, 2]], # 0-P1/2 | |
[-1, 1, Conv, [128, 3, 2]], # 1-P2/4 | |
[-1, 3, C3, [128]], | |
[-1, 1, Conv, [256, 3, 2]], # 3-P3/8 | |
[-1, 6, C3, [256]], | |
[-1, 1, Conv, [512, 3, 2]], # 5-P4/16 | |
[-1, 9, C3, [512]], | |
[-1, 1, Conv, [1024, 3, 2]], # 7-P5/32 | |
[-1, 3, C3, [1024]], | |
[-1, 1, SPPF, [1024, 5]], # 9 | |
] | |
# YOLOv5 v6.0 head | |
head: | |
[[-1, 1, Conv, [512, 1, 1]], | |
[-1, 1, nn.Upsample, [None, 2, 'nearest']], | |
[[-1, 6], 1, Concat, [1]], # cat backbone P4 | |
[-1, 3, C3, [512, False]], # 13 | |
[-1, 1, Conv, [256, 1, 1]], | |
[-1, 1, nn.Upsample, [None, 2, 'nearest']], | |
[[-1, 4], 1, Concat, [1]], # cat backbone P3 | |
[-1, 3, C3, [256, False]], # 17 (P3/8-small) | |
[-1, 1, Conv, [256, 3, 2]], | |
[[-1, 14], 1, Concat, [1]], # cat head P4 | |
[-1, 3, C3, [512, False]], # 20 (P4/16-medium) | |
[-1, 1, Conv, [512, 3, 2]], | |
[[-1, 10], 1, Concat, [1]], # cat head P5 | |
[-1, 3, C3, [1024, False]], # 23 (P5/32-large) | |
[[17, 20, 23], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5) | |
] |
TensorBoard Graph
Simply start training a model, and then view the TensorBoard Graph for an interactive view of the model architecture. This example shows YOLOv5s viewed in our Notebook –
# Tensorboard
%load_ext tensorboard
%tensorboard --logdir runs/train
# Train YOLOv5s on COCO128 for 3 epochs
python train.py --weights yolov5s.pt --epochs 3
Netron viewer
Use https://netron.app to view exported ONNX models:
python export.py --weights yolov5s.pt --include onnx --simplify
Good luck 🍀 and let us know if you have any other questions!
@glenn-jocher that'll be a yes then. Thanks. |
@glenn-jocher
I've compared the state of m for both cpu and mps and there is no difference other than the parameter 'device' being present when running mps. e.g. I've also individually compared the states of Any other ideas? |
@jgoo9410 really strange. I think the difference is somewhere inside Detect(), i.e. maybe print inference forward pass feature values using both devices at different stages inside Detect(), i.e. print(x[i].mean()) all throughout Detect() forward: Lines 56 to 79 in 7398d2d
|
@glenn-jocher printing x reveals a difference. Forcing an error and tracing back through the sequence of function calls takes me back to AutoShape.forward(). Examining the variables within AutoShape.forward(), I noticed something strange. class AutoShape(nn.Module):
....
@smart_inference_mode()
def forward(self, ims, size=640, augment=False, profile=False):
#printx1')
# Inference from various sources. For size(height=640, width=1280), RGB images example inputs are:
# file: ims = 'data/images/zidane.jpg' # str or PosixPath
# URI: = 'https://ultralytics.com/images/zidane.jpg'
# OpenCV: = cv2.imread('image.jpg')[:,:,::-1] # HWC BGR to RGB x(640,1280,3)
# PIL: = Image.open('image.jpg') or ImageGrab.grab() # HWC x(640,1280,3)
# numpy: = np.zeros((640,1280,3)) # HWC
# torch: = torch.zeros(16,3,320,640) # BCHW (scaled to size=640, 0-1 values)
# multiple: = [Image.open('image1.jpg'), Image.open('image2.jpg'), ...] # list of images
dt = (Profile(), Profile(), Profile())
with dt[0]:
if isinstance(size, int): # expand
size = (size, size)
p = next(self.model.parameters()) if self.pt else torch.empty(1, device=self.model.device) # param
autocast = self.amp and (p.device.type != 'cpu') # Automatic Mixed Precision (AMP) inference
if isinstance(ims, torch.Tensor): # torch
with amp.autocast(autocast):
return self.model(ims.to(p.device).type_as(p), augment=augment) # inference
# Pre-process
n, ims = (len(ims), list(ims)) if isinstance(ims, (list, tuple)) else (1, [ims]) # number, list of images
shape0, shape1, files = [], [], [] # image and inference shapes, filenames
for i, im in enumerate(ims):
f = f'image{i}' # filename
if isinstance(im, (str, Path)): # filename or uri
im, f = Image.open(requests.get(im, stream=True).raw if str(im).startswith('http') else im), im
im = np.asarray(exif_transpose(im))
elif isinstance(im, Image.Image): # PIL Image
im, f = np.asarray(exif_transpose(im)), getattr(im, 'filename', f) or f
files.append(Path(f).with_suffix('.jpg').name)
if im.shape[0] < 5: # image in CHW
im = im.transpose((1, 2, 0)) # reverse dataloader .transpose(2, 0, 1)
im = im[..., :3] if im.ndim == 3 else cv2.cvtColor(im, cv2.COLOR_GRAY2BGR) # enforce 3ch input
s = im.shape[:2] # HWC
shape0.append(s) # image shape
g = max(size) / max(s) # gain
shape1.append([y * g for y in s])
ims[i] = im if im.data.contiguous else np.ascontiguousarray(im) # update
shape1 = [make_divisible(x, self.stride) for x in np.array(shape1).max(0)] if self.pt else size # inf shape
x = [letterbox(im, shape1, auto=False)[0] for im in ims] # pad
x = np.ascontiguousarray(np.array(x).transpose((0, 3, 1, 2))) # stack and BHWC to BCHW
x = torch.from_numpy(x).to(p.device).type_as(p) / 255 # uint8 to fp16/32
with amp.autocast(autocast):
# Inference
with dt[1]:
pprint(x) <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< here
y = self.model(x, augment=augment) # forward When printing x on a CPU run:
When printing x on an mps run:
In the 40,000 subsequent lines, no values other than 1 and 0 are present for the mps run. Looks like a rounding process is taking place to the wrong number of sig figs, or something even more strange. In between initiating the detection from my program, and it reaching the function above, it passes through torch/nn/modules/module.py "forward_call()" and torch/autograd/grad_mode.py "func()". My money is on the weirdness coming from somewhere in there, but I might need someone more experienced with these libraries to give me a hand if you want me to trace the problem any further. |
@jgoo9410 seems like autocast is not behaving well with MPS then. But I don't think the problem is autocast, because |
I'm afraid I'm at the limit of my current understanding with relation to this issue, and therefore can't comment. Im sure there are lots of other priorities with yolov5 at the moment, but being able to use mps really would be a game changer in terms of its use on such ubiquitous hardware. If there is any other way I could contribute to help get to the bottom of this issue, let me know. |
@jgoo9410 I'll take a look today, hold on. |
@jgoo9410 also yes this is a semi-priority, the confusion lies in the fact that torch itself is not fully MPS-supportive. Some modules we rely on like torchvision NMS and others are not yet supported, so I've take a wait approach until there is better support. |
I hear you. Had it not worked at all I'd probably be resigned to waiting, but it so nearly works, and I've witnessed the performance increase. I cant go back to cpu, I've developed a taste for the good stuff! |
Yeah it's almost working, and the performance increase is pretty dramatic, so when it does work that'll be great for all us Apple hardware ppl :) |
Probably time to take out some shares in Apple. |
@jgoo9410 yes, same situation now. We need aten::_unique2, aten::sort.values_stable and NMS, which are in various stages of support in pytorch/pytorch#77764, so I'd say contribute a thumbs up or comment on those on the torch issue and sit back and wait a bit. |
Okay, i'll see if I can support the devs working on those features in some way. I assume as a workaround, you have implemented a CPU version of the missing mps components? If so, is it obvious to you which one is misbehaving? |
@jgoo9410 torch itself has a fallback to revert to CPU which is PYTORCH_ENABLE_MPS_FALLBACK=1. You can see that YOLOv5 classification is producing identical results on CPU and MPS with this: PYTORCH_ENABLE_MPS_FALLBACK=1 python classify/predict.py --device cpu
PYTORCH_ENABLE_MPS_FALLBACK=1 python classify/predict.py --device mps But detection is not. This is why I think there may be an issue with the Detect() head, because the rest of the detection model is very much in common with the classification version: PYTORCH_ENABLE_MPS_FALLBACK=1 python detect.py --device cpu
PYTORCH_ENABLE_MPS_FALLBACK=1 python detect.py --device mps |
Right so, if it's not an implemented feature it will fallback to using the CPU version of that feature? Perhaps then the datatype of some mps function is being 'cast' improperly during the transition causing the output I listed above? |
I can debug this very simply. If I place print(x[i].mean()) at the beginning of the Detect.forward method and print(y.mean()) at the end, I can see that x[i] is identical for CPU and MPS, but y is not, so it's likely that the grids/anchors are not transferring properly between devices. def forward(self, x):
z = [] # inference output
for i in range(self.nl):
x[i] = self.m[i](x[i]) # conv
bs, _, ny, nx = x[i].shape # x(bs,255,20,20) to x(bs,3,20,20,85)
x[i] = x[i].view(bs, self.na, self.no, ny, nx).permute(0, 1, 3, 4, 2).contiguous()
print(x[i].mean()) # < ---- PRINT RESULT BEFORE GRIDS/ANCHORS ---------------------
if not self.training: # inference
if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
if isinstance(self, Segment): # (boxes + masks)
xy, wh, conf, mask = x[i].split((2, 2, self.nc + 1, self.no - self.nc - 5), 4)
xy = (xy.sigmoid() * 2 + self.grid[i]) * self.stride[i] # xy
wh = (wh.sigmoid() * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, conf.sigmoid(), mask), 4)
else: # Detect (boxes only)
xy, wh, conf = x[i].sigmoid().split((2, 2, self.nc + 1), 4)
xy = (xy * 2 + self.grid[i]) * self.stride[i] # xy
wh = (wh * 2) ** 2 * self.anchor_grid[i] # wh
y = torch.cat((xy, wh, conf), 4)
z.append(y.view(bs, self.na * nx * ny, self.no))
print(y.mean()) # < ---- PRINT RESULT AFTER GRIDS/ANCHORS ------------------------
return x if self.training else (torch.cat(z, 1),) if self.export else (torch.cat(z, 1), x) |
Looks like |
self.stride also, which I think is used to calculate self.anchor_grid, so it's probably the origin of the problem. |
Is |
@jgoo9410 anchor_grid depends on image size. There seems to be some bugs to work out in MPS. If I run this simple command I get erroneous output on the last term. The .mean() op is failing to run on the correct index. Seems like a PyTorch bug. if not self.training: # inference
if self.dynamic or self.grid[i].shape[2:4] != x[i].shape[2:4]:
self.grid[i], self.anchor_grid[i] = self._make_grid(nx, ny, i)
print(i, self.stride, self.stride[i], self.stride[i].mean())
0 tensor([ 8., 16., 32.], device='mps:0') tensor(8., device='mps:0') tensor(8., device='mps:0')
1 tensor([ 8., 16., 32.], device='mps:0') tensor(16., device='mps:0') tensor(8., device='mps:0')
2 tensor([ 8., 16., 32.], device='mps:0') tensor(32., device='mps:0') tensor(8., device='mps:0') |
@glenn-jocher okay. For my specific use case, I’m performing detections on a video, so my images are all the same size. In that case, if I were able to use a cpu run to get the anchor_grid and then apply it to the mps run, in theory it should work? Confirming the issue is exclusively with the above. |
@jgoo9410 you can experiment to see if you can find a solution in this area. |
👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs. Access additional YOLOv5 🚀 resources:
Access additional Ultralytics ⚡ resources:
Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed! Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐! |
So, not much progress in finding the source of the issue, but I have another symptom: It appears that the whole issue is in the first x-coord in 'boxes'. If I use the difference in y (y2-y1) to determine the size of the square (won't work for rectangles obviously), and then anchor the square using the second x coord (x2), I get perfect tracking. Interestingly, when the first x value is incorrect, it is always pinned to the centre of the image, exactly 50% of the resolution. It may be the case that if I used a video that was 'portrait' rather than 'landscape' the issue would be with the Y coordinate, as I assume they are calculated identically.
Example of the error: Everything is working fine at this point and detections are being located correctly:
Here is there the issue starts. You can see that the first x value is being pinned to 352, which is 50% of the width.
I could understand if this was an overflow or rounding, or something of that nature, but I would have expected the second x value to have suffered in the same way, which it clearly hasn't. Anything about this jumping out to you @glenn-jocher? Here is a video of the issue: https://imgur.com/a/vkaEzbi |
@jgoo9410 This is indeed a strange behavior. The fact that the issue is consistently seen with the first x value consistently being at 50% of the width hints at a potential bug related to miscalculation or transformation. Unfortunately, without access to the actual video footage, pinpointing the exact cause can be challenging. I would recommend continuing to investigate and possibly reaching out to the YOLO community or the Ultralytics team for further insight. |
@glenn-jocher This is quite an old issue, but i have more information on it. It appears to be unrelated to Ultralytics, and is actually related to torch and torch vision. The issue we present when I created this post in Feb of this year, obviously. In maybe August time, with an inadvertent update of torch and torch vision the issue disappeared and all was well. In the last month or so, the latest version of torch has reintroduced the issue. I can confirm that with torch==2.0.1 and torchvision==0.15.2, the issue is not present. Thats the version I'm sticking with for now. I know this is only half the picture, but hopefully someone can tell you a version it is definitely not working with to help with further investigation. |
Thanks for sharing this valuable information, @jgoo9410. It's great to have additional context on this issue and the specific versions of torch and torchvision where the problem arises. This will be helpful for others encountering similar issues and for ongoing investigation. We appreciate your contribution to the community! |
Search before asking
YOLOv5 Component
Detection
Bug
When using the yolov5 python module and targeting mps as the desired device (
model = yolov5.load('best_1.pt','mps')
) appears to yield inaccurate results.The detections are correct in the sense that there is a detectable object in frame. The problem is that the locations of the boxes are often, but not always incorrect. Some of the time the boxes are about 30% of the image shifted to the left, other times the boxes are exactly correct. When the boxes are incorrectly located they don't jitter, but track with the object just offset by a distance.
My intuition tells me this is a rounding error somewhere. I'm imaging its within the nightly build of torch or torchvision.
Has anyone else come across this issue? Would be great to be able to utilise the fast inferencing speed of the M1 GPU.
Targeting the cpu (default) yields the correct results (
model = yolov5.load('best_1.pt')
)Environment
No response
Minimal Reproducible Example
No response
Additional
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: