Hi gradcheck failed #1

mks0601 · 2017-12-09T16:44:04Z

Hi thanks for sharing your implementation.

I want to use RoIAlign layer in my pytorch code, and I found your implementation.
To verify your implementation, I ran the test.py and the gradcheck failed.
Did you check the code?

longcw · 2017-12-10T04:02:19Z

I also noticed this problem. There is a gap between numerical grad and analytical grad.
But outputs and grads of pytorch version and tensorflow version are almost the same.

 numerical:(
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.2012  0.0000  0.0000  0.0000
 0.6258  0.1490  0.0000  0.0000
 0.0000  0.6855  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0373  0.0000  0.1788  0.0000
 0.1341  0.0298  0.5662  0.1192
 0.0000  0.1490  0.0000  0.5960
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0596  0.0000
 0.0000  0.0000  0.2086  0.0596
 0.0000  0.0000  0.0000  0.2384
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
[torch.FloatTensor of size 25x4]
,)
analytical:(
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.2111  0.0000  0.0000  0.0000
 0.6141  0.1408  0.0000  0.0000
 0.0000  0.6844  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0447  0.0000  0.1893  0.0000
 0.1300  0.0298  0.5507  0.1263
 0.0000  0.1449  0.0000  0.6138
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0665  0.0000
 0.0000  0.0000  0.1934  0.0444
 0.0000  0.0000  0.0000  0.2156
 0.0000  0.0000  0.0000  0.0000
 0.0000  0.0000  0.0000  0.0000
[torch.FloatTensor of size 25x4]
,)

mks0601 · 2017-12-10T04:57:33Z

Thank you for check Did you achieved the similar result with mask-rcnn with your roi align module?

…

-- Gyeongsik Moon Ph.D. Candidate Department of ECE, SNU, Seoul, Korea http://cv.snu.ac.kr <http://cv.snu.ac.kr/>

2017. 12. 10. 오후 1:02, longcw ***@***.***> 작성: I also noticed this problem. There is a gap between numerical grad and analytical grad. But outputs and grads of pytorch version and tensorflow version are almost the same. numerical:( 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2012 0.0000 0.0000 0.0000 0.6258 0.1490 0.0000 0.0000 0.0000 0.6855 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0373 0.0000 0.1788 0.0000 0.1341 0.0298 0.5662 0.1192 0.0000 0.1490 0.0000 0.5960 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0596 0.0000 0.0000 0.0000 0.2086 0.0596 0.0000 0.0000 0.0000 0.2384 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [torch.FloatTensor of size 25x4] ,) analytical:( 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.2111 0.0000 0.0000 0.0000 0.6141 0.1408 0.0000 0.0000 0.0000 0.6844 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0447 0.0000 0.1893 0.0000 0.1300 0.0298 0.5507 0.1263 0.0000 0.1449 0.0000 0.6138 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0665 0.0000 0.0000 0.0000 0.1934 0.0444 0.0000 0.0000 0.0000 0.2156 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000 [torch.FloatTensor of size 25x4] ,) — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AM-Lu8_QGoegiZsOVSnjnQNJU02rcEUeks5s-1fLgaJpZM4Q8J9C>.

longcw · 2017-12-10T06:18:11Z

I am not working on Mask RCNN.
BTW, I found that this layer can pass the gradcheck if I set eps=1e-3.
eps is the perturbation for finite differences.

gradcheck(roi_align, (image_torch, boxes, box_index), eps=1e-3)

output (max_val, min_error, max_error, mean_error):

('forward:', 0.87139809, 0.0, 7.0184469e-06, 5.5792748e-07)
('backward:', 1.0228419, 0.0, 1.7911196e-05, 9.7078487e-09)
test ok

mks0601 · 2017-12-10T06:51:29Z

How many time did you run the gradcheck? I ran it 10 times, but only 2 passsed the gradcheck.

…

-- Gyeongsik Moon Ph.D. Candidate Department of ECE, SNU, Seoul, Korea <http://cv.snu.ac.kr/> http://cv.snu.ac.kr/ From: longcw [mailto:[email protected]] Sent: Sunday, December 10, 2017 3:18 PM To: longcw/RoIAlign.pytorch <[email protected]> Cc: Gyeongsik Moon <[email protected]>; Author <[email protected]> Subject: Re: [longcw/RoIAlign.pytorch] Hi gradcheck failed (#1) I am not working on Mask RCNN. BTW, I found that this layer can pass the gradcheck if I set eps=1e-3. eps is the perturbation for finite differences. gradcheck(roi_align, (image_torch, boxes, box_index), eps=1e-3) output (max_val, min_error, max_error, mean_error): ('forward:', 0.87139809, 0.0, 7.0184469e-06, 5.5792748e-07) ('backward:', 1.0228419, 0.0, 1.7911196e-05, 9.7078487e-09) test ok — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#1 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AM-Lu8lNUQkESuGHouF_jRgJrZluMwJiks5s-3ejgaJpZM4Q8J9C> . <https://github.com/notifications/beacon/AM-LuytAF_ONMjU1R8hR8Lw-xK1mVysYks5s-3ejgaJpZM4Q8J9C.gif>

longcw · 2017-12-10T06:58:55Z

@mks0601 Try to modify the random input image:

# image_data = np.random.randn(batch_size, depth, im_height, im_width).astype(np.float32)
# =>
image_data = np.random.rand(batch_size, depth, im_height, im_width).astype(np.float32)

mks0601 · 2017-12-10T07:02:08Z

Sorry to say, but changing the input seems not good... It shows the implemented roi align layer is working on the specific input form (or at least, it does not work on the specific input form such as randn). Can you tell me why there exists that kind of error?

…

-- Gyeongsik Moon Ph.D. Candidate Department of ECE, SNU, Seoul, Korea <http://cv.snu.ac.kr/> http://cv.snu.ac.kr/ From: longcw [mailto:[email protected]] Sent: Sunday, December 10, 2017 3:59 PM To: longcw/RoIAlign.pytorch <[email protected]> Cc: Gyeongsik Moon <[email protected]>; Mention <[email protected]> Subject: Re: [longcw/RoIAlign.pytorch] Hi gradcheck failed (#1) @mks0601 <https://github.com/mks0601> Try to modify the random input image: # image_data = np.random.randn(batch_size, depth, im_height, im_width).astype(np.float32) # => image_data = np.random.rand(batch_size, depth, im_height, im_width).astype(np.float32) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AM-Lux14lIa-OJC00cQamIRTF1-ooj1eks5s-4EvgaJpZM4Q8J9C> . <https://github.com/notifications/beacon/AM-Lu7T5Ij_Ja2EidOFyR9kOncNzcC-fks5s-4EvgaJpZM4Q8J9C.gif>

longcw · 2017-12-10T07:55:06Z

I don't think this is the problem of the implementation. It's the problem we using gradcheck.
Changing randn to rand actually decreases the max value of inputs. It can always pass the check if eps > max(inputs)/500, whatever the input is.

I don't know the real reason. You can check the gradcheck function and the source code if you want to figure out the reason for this problem.

mks0601 · 2017-12-10T08:17:36Z

Also, can you let me understand the result of your roi_align layer? If I fed the input tensor as 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 0 1 2 3 4 5 6 (1x1x7x7) to the roi_align layer (crop_height=3, crop_width=3) with argument (xs = [[0,2]], ys = [[0,2]], nbox=1, nbatch=1) then I think the output should be 0 1 2 0 1 2 0 1 2 . But, the output of your implementation is different. Did I understand the roi align in a wrong way?

…

-- Gyeongsik Moon Ph.D. Candidate Department of ECE, SNU, Seoul, Korea <http://cv.snu.ac.kr/> http://cv.snu.ac.kr/ From: longcw [mailto:[email protected]] Sent: Sunday, December 10, 2017 4:55 PM To: longcw/RoIAlign.pytorch <[email protected]> Cc: Gyeongsik Moon <[email protected]>; Mention <[email protected]> Subject: Re: [longcw/RoIAlign.pytorch] Hi gradcheck failed (#1) I don't think this is the problem of the implementation. It's the problem we using gradcheck. Changing randn to rand actually decreases the max value of inputs. It can always pass the check if eps > max(inputs)/500, whatever the input is. I don't know the real reason. You can check the gradcheck function and the source code if you want to figure out the reason for this problem. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1 (comment)> , or mute the thread <https://github.com/notifications/unsubscribe-auth/AM-Lu9NjLL3PsXS4Kw3aF3VFACog-Zi4ks5s-45agaJpZM4Q8J9C> . <https://github.com/notifications/beacon/AM-Lu-9_cbPKx0Hv0WYJ-EkTnZS2pAp6ks5s-45agaJpZM4Q8J9C.gif>

longcw · 2017-12-10T08:42:31Z

What you want is crop_and_resize.

import numpy as np
import torch
from torch.autograd import Variable

from roi_align.roi_align import RoIAlign


def to_varabile(arr, requires_grad=False, is_cuda=True):
    tensor = torch.from_numpy(arr)
    if is_cuda:
        tensor = tensor.cuda()
    var = Variable(tensor, requires_grad=requires_grad)
    return var


# inputs
is_cuda = False
image_data = np.tile(np.arange(7, dtype=np.float32), 7).reshape(7, 7)
image_data = image_data[np.newaxis, np.newaxis]
boxes_data = np.asarray([[0, 0, 2, 2]], dtype=np.float32)
box_index_data = np.asarray([0], dtype=np.int32)

image_torch = to_varabile(image_data, requires_grad=True, is_cuda=is_cuda)
boxes = to_varabile(boxes_data, requires_grad=False, is_cuda=is_cuda)
box_index = to_varabile(box_index_data, requires_grad=False, is_cuda=is_cuda)

# set transform_fpcoor to False is the crop_and_resize
roi_align = RoIAlign(3, 3, transform_fpcoor=False)
print(roi_align(image_torch, boxes, box_index))

output:

(0 ,0 ,.,.) = 
  0  1  2
  0  1  2
  0  1  2
[torch.cuda.FloatTensor of size 1x1x3x3 (GPU 0)]

If use RoIAlign in this implimentation:

# input
...
boxes_data = np.asarray([[0, 0, 3, 3]], dtype=np.float32)
...
roi_align = RoIAlign(3, 3, transform_fpcoor=True)
print(roi_align(image_torch, boxes, box_index))

output:

Variable containing:
(0 ,0 ,.,.) = 
  0  1  2
  0  1  2
  0  1  2
[torch.FloatTensor of size 1x1x3x3]

You can read more about the roialign here:
https://github.com/ppwwyyxx/tensorpack/blob/6d5ba6a970710eaaa14b89d24aace179eb8ee1af/examples/FasterRCNN/NOTES.md
https://github.com/ppwwyyxx/tensorpack/blob/6d5ba6a970710eaaa14b89d24aace179eb8ee1af/examples/FasterRCNN/model.py#L316

mks0601 · 2017-12-10T12:57:40Z

Oh, I misunderstood the code of yours. Thanks for clarifying.
However, it is hard for me to understand the link you provided :(

Can you help me to understand the link you provided? I read the NOTES.md, however I cannot understand why crop_and_resize is different from roi_align except the input form (normalized/unnormalized coordinates?). Also, I cannot understand the code.

If I just set the boxes_data as [xmin, ymin, xmax+1, ymax+1] and set transform_fpcoor=True, then it seems works well so far.
And can you let me know 'just set the boxes_data as [xmin, ymin, xmax+1, ymax+1]' is correct? Do the fpcoord stand for feature plane corodinates?

longcw · 2017-12-11T08:50:15Z

Crop_and_resize (bilinear sample assumes floating point coordinate (0.0, 0.0) is the same as pixel value (0, 0):

RoIAlign: split the RoI into crop_size grids with the same size first, then bilinear sample the value for each grid:

To use crop_and_resize for RoIAlign, we shift the grids with -0.5:

In your case, the crop is

Variable containing:
(0 ,0 ,.,.) = 
  0.0000  0.0000  0.0000
  0.0000  0.5000  1.1667
  0.0000  0.5000  1.1667
[torch.FloatTensor of size 1x1x3x3]

if you set bbox=[0, 0, 2, 2]:

mks0601 · 2017-12-11T16:41:13Z

Great help. Thank you.
So the difference arises from dividing roi into grids.
Then, I think just using [xmin, ymin, xmax+1, ymax+1] can output desired value where each values are float coordinates.
Is that right?

mks0601 · 2017-12-14T18:03:00Z

Sorry, but I still cannot understand the code. What is the input of your roi_align module?
Let`s say the bounding box coordinate of roi is (xmin, ymin, xmax, ymax). Then, what is the input of your roi_align module?

spacing_w is function of x1-x0, not x1-x0+1. So, I think xmax and ymax should be ++. Also, I cannot understand why we have to subtract 0.5.

what if just

x0, y0, x1, y1 = tf.split(boxes, 4, axis=1)

nx0 = x0 / tf.to_float(image_shape[1] - 1)
ny0 = y0 / tf.to_float(image_shape[0] - 1)

nx1 = x1 / tf.to_float(image_shape[1] - 1)
ny1 = y1 / tf.to_float(image_shape[0] - 1)

return tf.concat([ny1, nx1, ny1, nx1], axis=1)

and transform_fpcoor = False?

tensorboy · 2017-12-21T05:03:23Z

Hi, @mks0601, you may try how to use it for MASK-RCNN at here: https://github.com/tensorboy/Pytorch_Mask_RCNN. :)

fitsumreda · 2018-01-13T02:32:38Z

@tensorboy i couldn't access the link.
Could you share a working link?

pachiko · 2019-06-21T15:29:47Z

Crop_and_resize (bilinear sample assumes floating point coordinate (0.0, 0.0) is the same as pixel value (0, 0):

RoIAlign: split the RoI into crop_size grids with the same size first, then bilinear sample the value for each grid:

To use crop_and_resize for RoIAlign, we shift the grids with -0.5:

In your case, the crop is
Variable containing:
(0 ,0 ,.,.) = 
  0.0000  0.0000  0.0000
  0.0000  0.5000  1.1667
  0.0000  0.5000  1.1667
[torch.FloatTensor of size 1x1x3x3]
if you set bbox=[0, 0, 2, 2]:

The pictures are missing... Would be great if you can reupload them :)

turboxin mentioned this issue Aug 26, 2019

fp16 support #33

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi gradcheck failed #1

Hi gradcheck failed #1

mks0601 commented Dec 9, 2017

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email •

edited

Loading

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 •

edited

Loading

longcw commented Dec 11, 2017

mks0601 commented Dec 11, 2017

mks0601 commented Dec 14, 2017 •

edited

Loading

tensorboy commented Dec 21, 2017

fitsumreda commented Jan 13, 2018

pachiko commented Jun 21, 2019

Hi gradcheck failed #1

Hi gradcheck failed #1

Comments

mks0601 commented Dec 9, 2017

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email • edited Loading

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 via email

longcw commented Dec 10, 2017

mks0601 commented Dec 10, 2017 • edited Loading

longcw commented Dec 11, 2017

mks0601 commented Dec 11, 2017

mks0601 commented Dec 14, 2017 • edited Loading

tensorboy commented Dec 21, 2017

fitsumreda commented Jan 13, 2018

pachiko commented Jun 21, 2019

mks0601 commented Dec 10, 2017 via email •

edited

Loading

mks0601 commented Dec 10, 2017 •

edited

Loading

mks0601 commented Dec 14, 2017 •

edited

Loading