Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The consistency between the code and the description of the paper. #11

Open
sang-yc opened this issue Sep 29, 2021 · 8 comments
Open

The consistency between the code and the description of the paper. #11

sang-yc opened this issue Sep 29, 2021 · 8 comments

Comments

@sang-yc
Copy link

sang-yc commented Sep 29, 2021

Hello, I found that the M0 of the code and the M0 of the paper are not the same structure. I would like to ask whether the code of Micro-Block-A, Micro-Block-B, and Micro-Block-C is consistent with the description of the paper and whether there is any difference?
Thank you.

@liyunsheng13
Copy link
Owner

All the models are consistent with the description in the paper. Using M0 as an example, if you take a look at Table 1 in the paper, you can find the hidden dimension C/R is {8,12,16,32,64,96}, which is exactly the same as the config file used for M0.

@sang-yc
Copy link
Author

sang-yc commented Sep 30, 2021

First of all, thank you for your reply!
I still have questions about Dynamic Shift Max. I have carefully studied your paper, Dynamic ReLU. As your paper says, when J = 1, Dynamic Shift Max and Dynamic ReLU are the same.
Why is j taken as 2? Shouldn't j be the same as the number of groups? When the number of groups is different, shouldn't j change dynamically?
Thank you!

@liyunsheng13
Copy link
Owner

Like you mentioned, when J=1, Dynamic Shift-Max is just Dynamic ReLU with the expression like y=max(a1x1+b1x1) (x1 is the first channel of the feature map, a1,b1 are the dynamic coefficients). When J=2, the output y will become max(a1x1+a2x2, b1x2+b2x2), where channel x1 and x2 are fused. This is the key difference compared to Dynamic ReLU. In our implementation, actually, we found x2 should be achieved with group shift instead of channel shift, thus it is x_{jC/G}. So the value of J has nothing to do with the group number, it just depends on how many channels you want to fuse and of course J<=G.

@sang-yc
Copy link
Author

sang-yc commented Oct 2, 2021

Thank you for your reply!
Dynamic ReLU : y=max(a1x1+b1x1) . According to your code and paper, I think the expression for Dynamic ReLU should be y=max(a1x1+b1)(x1 is the first channel of the feature map, a1,b1 are the dynamic coefficients).I don't know if it's my wrong understanding or your wrong writing.
There are still some in the code that is difficult to understand. What does the parameter in class Dynamic Shift-Max mean?
As follows: activation.py, line 111
def init(self, inp, oup, reduction=4, act_max=1.0, act_relu=True, init_a=[0.0, 0.0], init_b=[0.0, 0.0], relu_before_pool=False, g=None, expansion=False)
The parameters are much more complex than Dynamic ReLU. I hope you can tell me what these parameters represent in Dynamic Shift-Max. I understand inp and oup.
Also, like Dynamic ReLU, the number of parameters is 2KC.The number of parameters of Dynamic Shift-Max should be 2KCJ. Why is the number of parameters in your paper is KCJ?
Thank you!

@liyunsheng13
Copy link
Owner

Oh sorry, my writing is incorrect. The expression of Dynamic ReLU is y=max(a1x1, b1x1). It just picks up the feature point with stronger activation.

For the meaning of the input parameters, unfortunately, they are about the implementation details such as initialization (init_a=[0.0, 0.0], init_b=[0.0, 0.0]) and it is hard for me to explain them. Besides, it has nothing to do with the understanding of Dynamic Shift-Max. I suggest you just to run the code step by step and you can get how the parameters influence the implementation easily.

For the number of parameters contained in Dynamic shift-max, since it considers channel shifting, it has to be implemented with moer parameters. For J=2, Dynamic Shift-max is max(a1x1+a2x2, b1x2+b2x2) with parametes a1, a2, b1 and b2 which doubles the parameters contained with dynamic relu (y=max(a1x1, b1x1))

@FlyMoonSky
Copy link

For M0, output channel of stem layer is 4 in the code, while it's 6 in the paper. I'm confused.

@liyunsheng13
Copy link
Owner

There is no inconsistency for M0. You might read the old version of our paper.

@FlyMoonSky
Copy link

There is no inconsistency for M0. You might read the old version of our paper.

Thank you for your kind reply! It's really the problem of paper version. I have refered to the latest version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants