Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why did you use MomentumOptimizer? and dropout... #28

Open
taki0112 opened this issue Aug 8, 2017 · 5 comments
Open

Why did you use MomentumOptimizer? and dropout... #28

taki0112 opened this issue Aug 8, 2017 · 5 comments

Comments

@taki0112
Copy link

taki0112 commented Aug 8, 2017

Hello
When I saw DenseNet, I implemented it with Tensorflow. (Using MNIST data)

The Questions are :

  1. When I experimented, AdamOptimizer performed better than MomentumOptimizer.
    Is this just MNIST? I do not yet have an experiment with CIFAR.

  2. In the case of dropout, I apply only to the bottleneck layer, not to the transition layer. is this right?

  3. Does Batch Normalization only apply when training? Or does it apply to both test and training?

  4. I wonder what global average pooling is.
    And I wonder how to do it in tensorflow.

Please advise if you have any special reason.
And if you can see the tensorflow code, I'd like you to see if I implemented it correctly.
https://github.com/taki0112/Densenet-Tensorflow

Thank you

@taki0112 taki0112 changed the title Why did you use MomentumOptimizer? Why did you use MomentumOptimizer? and dropout... Aug 8, 2017
@liuzhuang13
Copy link
Owner

liuzhuang13 commented Aug 8, 2017

Hello @taki0112

A1. As we mentioned in the paper, we directly followed ResNet's optimization settings (https://github.com/facebook/fb.resnet.torch), except that we train 300 epochs instead of ~160 epochs. We didn't try any other optimizers.

A2. In our experiment, we applied dropout to every conv layer except the first one of the network. But I guess there should be no significant difference whether you apply dropout on trans layers or not.

A3. This depends on what package you are using. Sorry I'm not familiar with Tensorflow's details.

A4. Global Average Pooling means you pool a feature map to a single number by taking average. For example, you have a 8x8 feature map, you take average of those 64 numbers and produce one number.

For tensorflow usage question like 3 and 4, you can probably find answers by looking at the third-
party tensorflow implementations we posted on our readme page. Thanks

@taki0112
Copy link
Author

taki0112 commented Aug 11, 2017

Thank you
I think I can do global average pooling as follows.

    def Global_Average_Pooling(x, stride=1) :
        width = np.shape(x)[1]
        height = np.shape(x)[2]
        pool_size = [width, height]
        return tf.layers.average_pooling2d(inputs=x, pool_size=pool_size, strides=stride) 
        # The stride value does not matter

But I have some questions.

  1. I experimented with MNIST data for a total of 100 layers and growth_k = 12. However, the result is worse than 20 layers. The training speed is very slow and the increase in accuracy is very narrow.

  2. why is not there a Transition Layer (4) in paper ?
    There are only 3 (Dense Block + Transition Layers) and final dense block and Classification layer..

What is the reason?

@liuzhuang13
Copy link
Owner

liuzhuang13 commented Aug 11, 2017

@taki0112

  1. Most people train a network with less than 5 layers and achieve very high accuracy on MNIST because it is such a simple dataset. If you train a too large network on MNIST, it might overfit to the training set and the accuracy might be worse. Thanks

  2. Because transition layers serves the purpose of downsampling. At last we have the global average pooling to do the downsample but we don't call it a transition layer.

@John1231983
Copy link

I think the author has a good explanation. Regarding the dropout, why did not use dropout in imagenet case? it is the big dataset, so we do not need it, right? Dropout often uses in before fully connected layer. But, you did not use it in both imagenet and cifar10, why? Thanks

@liuzhuang13
Copy link
Owner

@John1231983 Because ImageNet is big and also because we use heavy data augmentation, so we don't use dropout. This is also following our base code framework fb.resnet.torch.

For CIFAR10, when we use data augmentation (C10+), we don't use dropout. When we don't use data augmentation (C10), we actually use dropout. We've mentioned this in the paper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants