-
Notifications
You must be signed in to change notification settings - Fork 355
CaffeOnSpark slow in comparison with caffe #259
Comments
Make sure you are comparing the same total batch size. |
Firstly thank you for your answer. There are below 2 logs files of CaffeOnSpark. I have the impression that the communication is done correctly but the distribute caffe takes more time. |
I do not know your setup. We have seen slight improvement with lenet and on gpu. But CaffeOnSpark was not really designed to speed up tiny network/dataset like lenet/MNIST. We see more gain on Inception(or VGG)/Imagenet. Spark does create quite a bit overhead, though. |
Ok, I understand but even with a bigger dataset like CIFAR10 I don't see any improvement. This is my setup: Spark: Hardware: Do you have any idea what I can check ? Or do you advise me to train a much bigger dataset or a more complex network to see if the size is the problem ? |
Your hardware is fine. it is not obvious to me why CaffeOnSpark is so slow. |
Hi, I think I found something. Before I was running 10 000 steps with a batchsize of 32 and it took me 16 min. Now I changed it to 100 steps with a batchsize of 3200 and i took only 6 min. |
My previous post is wrong because if I want to make a fair comparison with Caffe I have to change also the batchsize in Caffe. |
The previous result were running on 2 nodes, when I run the training on 4 nodes there is an acceleration and it is running faster than on Caffe ! |
Great. One useful experiment will be to run CaffeOnSpark on a single node, then compare it to Caffe. |
I applied your advice and there are my final results for training the MNIST (28x28 pixels) dataset on a hadoop yarn cluster. |
Thanks for the result. CaffeOnSpark incurs quite a bit overhead on a single node. I don't know answer to your second question. As for the first question, Spark puts Caffe on each executor. They train in a synchronous fashion. i.e. each executor gets a batch, runs forward, then backward, the gradients get averaged then distributed before next batch is fetched. |
At the beginning I was reducing my batchsize to compare the same amount of data at the same time i.e. on a 4 nodes cluster each worker run 1 quarter of the batchsize (batchsize: 64 training step: 10000). |
The ultimate comparison should be this: how much time does it take to achieve certain accuracy, say 90%, for 1 node, 2 nodes, etc. This comparison is hard since one has to adjust parameters such as learning rate according to the total batch size. A simple metric is to look at overall processing rate, i.e. how may images processed per second. For example, with 1 node, if you set the batch size to 128, and it takes 2 seconds per iteration, that is 64 images/second. For 2-node, say you set the batch size to 128, and takes 3 seconds per iteration, that is 128*2/3 = 85 images/second. This metric however has to be used carefully. If the training diverges, speed does not matter anymore, you get garbage in the end. |
Hi,
I am new with big data and caffe and I tried to run CaffeOnSpark in standalone mode and also on a cluster (4 nodes each with 1 CPU, 16GB RAM, 4cores). On cluster I adapt always the batchsize in function of the clustersize but there is no gain of time.
Whether the dataset (MNIST or CIFAR10) I don't see any acceleration and the performances in comparison with caffe go worse.
For example Caffeonspark standalone mode with MNIST (as the example in the wiki) took 15min and on caffe with MKL library it took less than 5 min.
The connection between the node is not a problem I think because its speed is 1 GB/s.
Did I miss something ? Can somebody help me please ?
Thank you.
Best regards,
fouad2910
The text was updated successfully, but these errors were encountered: