CaffeOnSpark slow in comparison with caffe #259

fouad2910 · 2017-05-29T16:38:23Z

Hi,
I am new with big data and caffe and I tried to run CaffeOnSpark in standalone mode and also on a cluster (4 nodes each with 1 CPU, 16GB RAM, 4cores). On cluster I adapt always the batchsize in function of the clustersize but there is no gain of time.
Whether the dataset (MNIST or CIFAR10) I don't see any acceleration and the performances in comparison with caffe go worse.
For example Caffeonspark standalone mode with MNIST (as the example in the wiki) took 15min and on caffe with MKL library it took less than 5 min.
The connection between the node is not a problem I think because its speed is 1 GB/s.
Did I miss something ? Can somebody help me please ?
Thank you.
Best regards,
fouad2910

junshi15 · 2017-05-29T22:09:09Z

Make sure you are comparing the same total batch size.

#244

fouad2910 · 2017-05-30T03:02:45Z

Firstly thank you for your answer.
Then I adapted the lenet_memory_solver.prototxt to be the same as the solver in caffe for MNIST, i.e. I changed the max_iter from 2000 to 10000.
Before running CaffeOnSpark on 2 nodes, I adapted also the batch size (64 to 32). Then my total batch size with CaffeOnSpark is 2*32=64 and with Caffe is the same as before 64.
CaffeOnSpark took 16 min to succeed and Caffe 5 min.
When I enter the command top in a shell, I see that the java process is running 400% of my CPU, thenceforward I think this is not a problem of core setup.

There are below 2 logs files of CaffeOnSpark. I have the impression that the communication is done correctly but the distribute caffe takes more time.
log user.txt
container log.txt

junshi15 · 2017-05-30T16:01:47Z

I do not know your setup. We have seen slight improvement with lenet and on gpu. But CaffeOnSpark was not really designed to speed up tiny network/dataset like lenet/MNIST. We see more gain on Inception(or VGG)/Imagenet. Spark does create quite a bit overhead, though.

fouad2910 · 2017-05-30T16:53:47Z

Ok, I understand but even with a bigger dataset like CIFAR10 I don't see any improvement.
Caffeonspark with 2 workers: 48min
Caffe :20min
I was expecting maybe a small increasing of time for the Caffeonspark training because of the communication and the spark jobs distribution. But here the communication speed is 1Gb/s and it takes more than twice the training time on caffe.

This is my setup:
Hadoop:
Hadoop 2.6.4
Subversion https://git-wip-us.apache.org/repos/asf/hadoop.git -r 5082c73637530b0b7e115f9625ed7fac69f937e6
Compiled by jenkins on 2016-02-12T09:45Z
Compiled with protoc 2.5.0

Spark:
Spark version 2.0.0
Using Scala version 2.11.8 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_112)

Hardware:
Each of the 4 nodes have 1 CPU 16GB RAM with 4 identical processors like the following.
processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 63
model name : Intel(R) Xeon(R) CPU E5-2676 v3 @ 2.40GHz
stepping : 2
microcode : 0x36
cpu MHz : 2400.052
cache size : 30720 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 13
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm fsgsbase bmi1 avx2 smep bmi2 erms invpcid xsaveopt
bogomips : 4800.15
clflush size : 64
cache_alignment : 64
address sizes : 46 bits physical, 48 bits virtual

Do you have any idea what I can check ? Or do you advise me to train a much bigger dataset or a more complex network to see if the size is the problem ?
Thank you for your help!

junshi15 · 2017-05-31T07:47:38Z

Your hardware is fine. it is not obvious to me why CaffeOnSpark is so slow.

fouad2910 · 2017-05-31T15:57:36Z

Hi, I think I found something. Before I was running 10 000 steps with a batchsize of 32 and it took me 16 min. Now I changed it to 100 steps with a batchsize of 3200 and i took only 6 min.
Do you have any idea why ?

fouad2910 · 2017-05-31T18:11:33Z

My previous post is wrong because if I want to make a fair comparison with Caffe I have to change also the batchsize in Caffe.
After adapting the batchsize it took 4min32s to complete the training, therefore the problem still exists.
I observed also that increasing the batchsize allows to process more data and then it increases the throughput, but it can lead to very poor accuracy.

fouad2910 · 2017-06-01T17:21:44Z

The previous result were running on 2 nodes, when I run the training on 4 nodes there is an acceleration and it is running faster than on Caffe !

junshi15 · 2017-06-01T19:50:05Z

Great. One useful experiment will be to run CaffeOnSpark on a single node, then compare it to Caffe.

fouad2910 · 2017-06-02T03:16:18Z

I applied your advice and there are my final results for training the MNIST (28x28 pixels) dataset on a hadoop yarn cluster.
The setup is the same except for the number of training step and the batch size.
The batchsize is always 6400 and the number of step is 100/number of worker. Therefore the same amount of data is trained.
Then for 1 node ==> 100/1=100 steps.
for 2 nodes ==> 100/2 = 50 steps.
for 4 nodes ==> 100/4 = 25 steps.
Running the training on Caffe took 4min34s.
Running it on CaffeOnSpark took :
for 1 node : 15min4s
for 2 nodes : 6min36s
for 4 nodes : 3min34s.
I see 2 remarkable things.
For a small dataset the acceleration appears after using 4 machines and the accuracy decreases. For this later remark, I think one can increase the step size to increase the accuracy by converging further to the minimum of the loss function.
Do you have any idea of how does Spark distribute Caffe on the nodes and why there is an acceleration after running the training on 4 nodes ?
Thank you for you answer.

junshi15 · 2017-06-02T04:32:17Z

Thanks for the result. CaffeOnSpark incurs quite a bit overhead on a single node. I don't know answer to your second question. As for the first question, Spark puts Caffe on each executor. They train in a synchronous fashion. i.e. each executor gets a batch, runs forward, then backward, the gradients get averaged then distributed before next batch is fetched.

fouad2910 · 2017-06-02T13:59:48Z

At the beginning I was reducing my batchsize to compare the same amount of data at the same time i.e. on a 4 nodes cluster each worker run 1 quarter of the batchsize (batchsize: 64 training step: 10000).
Now I reduce the stepsize but I process more data at the same time i.e. each node run the same total batchsize but there are less training steps (batchsize: 6400 training step: 100)
In the two methods I process the same total amount of data but it is faster in the seconde one.
To sum up in the first case each node receives a fraction of the batchsize and in the second case each node receives the total batch size but there are less training steps.
I am trying to run a bigger dataset (2.5G) with the two methods to compare them.
Do you think that it is still a fair comparison if I reduce the stepsize instead of reducing the batchsize?
Then if the batchsize is given to each executor do you think that the access to the memory (the copy of the data to process on each node) takes time and could be the major factor of the slowing in the first method ? Because here the amount of data per batchsize is small (64 images of 28x28 pixels) then it has to access a lot of time to the memory.

junshi15 · 2017-06-02T22:57:10Z

The ultimate comparison should be this: how much time does it take to achieve certain accuracy, say 90%, for 1 node, 2 nodes, etc. This comparison is hard since one has to adjust parameters such as learning rate according to the total batch size.

A simple metric is to look at overall processing rate, i.e. how may images processed per second. For example, with 1 node, if you set the batch size to 128, and it takes 2 seconds per iteration, that is 64 images/second. For 2-node, say you set the batch size to 128, and takes 3 seconds per iteration, that is 128*2/3 = 85 images/second. This metric however has to be used carefully. If the training diverges, speed does not matter anymore, you get garbage in the end.

junshi15 mentioned this issue Sep 5, 2017

speed problem with caffe and CaffeOnSpark #277

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CaffeOnSpark slow in comparison with caffe #259

CaffeOnSpark slow in comparison with caffe #259

fouad2910 commented May 29, 2017

junshi15 commented May 29, 2017

fouad2910 commented May 30, 2017

junshi15 commented May 30, 2017

fouad2910 commented May 30, 2017

junshi15 commented May 31, 2017

fouad2910 commented May 31, 2017

fouad2910 commented May 31, 2017

fouad2910 commented Jun 1, 2017

junshi15 commented Jun 1, 2017

fouad2910 commented Jun 2, 2017

junshi15 commented Jun 2, 2017

fouad2910 commented Jun 2, 2017

junshi15 commented Jun 2, 2017 •

edited

Loading

CaffeOnSpark slow in comparison with caffe #259

CaffeOnSpark slow in comparison with caffe #259

Comments

fouad2910 commented May 29, 2017

junshi15 commented May 29, 2017

fouad2910 commented May 30, 2017

junshi15 commented May 30, 2017

fouad2910 commented May 30, 2017

junshi15 commented May 31, 2017

fouad2910 commented May 31, 2017

fouad2910 commented May 31, 2017

fouad2910 commented Jun 1, 2017

junshi15 commented Jun 1, 2017

fouad2910 commented Jun 2, 2017

junshi15 commented Jun 2, 2017

fouad2910 commented Jun 2, 2017

junshi15 commented Jun 2, 2017 • edited Loading

junshi15 commented Jun 2, 2017 •

edited

Loading