Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[K8S] Encounter executors restart issue, when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset #109

Open
zhixingheyi-tian opened this issue Aug 2, 2021 · 9 comments

Comments

@zhixingheyi-tian
Copy link
Collaborator

We are integrating OAP MLlib into Cloud. Encountered the executors restart issue, although the result still outputted , when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset.

Spark confs and command:

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

data scale conf

hibench.kmeans.small.num_of_clusters		5
hibench.kmeans.small.dimensions			20
hibench.kmeans.small.num_of_samples		3000000
hibench.kmeans.small.samples_per_inputfile	600000
hibench.kmeans.small.max_iteration		5
hibench.kmeans.small.k				10

logs

21/07/30 13:01:25 INFO BlockManagerInfo: Added broadcast_10_piece0 in memory on 192.168.81.251:46745 (size: 16.8 KiB, free: 107.0 MiB)
21/07/30 13:02:25 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
21/07/30 13:02:25 INFO DAGScheduler: Executor lost: 1 (epoch 1)
21/07/30 13:02:25 INFO TaskSetManager: Finished task 0.0 in stage 9.0 (TID 107) in 59757 ms on 192.168.32.242 (executor 2) (1/2)
21/07/30 13:02:25 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_6 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_8 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_9 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_15 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_3 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_14 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_11 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_37_16 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_14 !
21/07/30 13:02:25 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_6_1 !
...
21/07/30 06:37:18 INFO DAGScheduler: Job 10 finished: collect at ClusteringSummary.scala:49, took 124.838207 s
21/07/30 06:37:18 INFO CodeGenerator: Code generated in 17.830358 ms
21/07/30 06:37:18 INFO Instrumentation: [d644dcde] {"clusterSizes":"[210265,121453,427029,123869,684244,207123,389735,600544,118074,117664]"}
21/07/30 06:37:18 INFO Instrumentation: [d644dcde] training finished
Training time (ms) = 744550
Total cost = 1.3657241593861768E11.
21/07/30 06:37:18 INFO SparkUI: Stopped Spark web UI at http://spark-client-headless-service.default.svc.cluster.local:4040


But when vanilla spark ran this workload with the same confs, everything is ok.

Vanilla spark command:

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples


Vanilla spark running log

21/08/02 07:18:45 INFO TaskSetManager: Finished task 16.0 in stage 14.0 (TID 187) in 190 ms on 192.168.57.153 (executor 1) (17/20)
21/08/02 07:18:45 INFO TaskSetManager: Finished task 18.0 in stage 14.0 (TID 188) in 138 ms on 192.168.85.148 (executor 2) (18/20)
21/08/02 07:18:45 INFO TaskSetManager: Starting task 19.0 in stage 14.0 (TID 190) (192.168.57.153, executor 1, partition 19, PROCESS_LOCAL, 4572 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO TaskSetManager: Finished task 17.0 in stage 14.0 (TID 189) in 211 ms on 192.168.57.153 (executor 1) (19/20)
21/08/02 07:18:45 INFO TaskSetManager: Finished task 19.0 in stage 14.0 (TID 190) in 105 ms on 192.168.57.153 (executor 1) (20/20)
21/08/02 07:18:45 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
21/08/02 07:18:45 INFO DAGScheduler: ShuffleMapStage 14 (collect at ClusteringSummary.scala:49) finished in 2.552 s
21/08/02 07:18:45 INFO DAGScheduler: looking for newly runnable stages
21/08/02 07:18:45 INFO DAGScheduler: running: Set()
21/08/02 07:18:45 INFO DAGScheduler: waiting: Set(ResultStage 15)
21/08/02 07:18:45 INFO DAGScheduler: failed: Set()
21/08/02 07:18:45 INFO DAGScheduler: Submitting ResultStage 15 (MapPartitionsRDD[50] at collect at ClusteringSummary.scala:49), which has no missing parents
21/08/02 07:18:45 INFO MemoryStore: Block broadcast_26 stored as values in memory (estimated size 50.5 KiB, free 413.4 MiB)
21/08/02 07:18:45 INFO MemoryStore: Block broadcast_26_piece0 stored as bytes in memory (estimated size 22.3 KiB, free 413.4 MiB)
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on spark-client-headless-service.default.svc.cluster.local:33081 (size: 22.3 KiB, free: 413.8 MiB)
21/08/02 07:18:45 INFO SparkContext: Created broadcast 26 from broadcast at DAGScheduler.scala:1383
21/08/02 07:18:45 INFO DAGScheduler: Submitting 2 missing tasks from ResultStage 15 (MapPartitionsRDD[50] at collect at ClusteringSummary.scala:49) (first 15 tasks are for partitions Vector(0, 1))
21/08/02 07:18:45 INFO TaskSchedulerImpl: Adding task set 15.0 with 2 tasks resource profile 0
21/08/02 07:18:45 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 191) (192.168.57.153, executor 1, partition 0, NODE_LOCAL, 4465 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO TaskSetManager: Starting task 1.0 in stage 15.0 (TID 192) (192.168.85.148, executor 2, partition 1, NODE_LOCAL, 4465 bytes) taskResourceAssignments Map()
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 192.168.85.148:36707 (size: 22.3 KiB, free: 91.2 MiB)
21/08/02 07:18:45 INFO BlockManagerInfo: Added broadcast_26_piece0 in memory on 192.168.57.153:45589 (size: 22.3 KiB, free: 107.0 MiB)
21/08/02 07:18:45 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to 192.168.85.148:51316
21/08/02 07:18:45 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to 192.168.57.153:55576
21/08/02 07:18:46 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 191) in 76 ms on 192.168.57.153 (executor 1) (1/2)
21/08/02 07:18:46 INFO TaskSetManager: Finished task 1.0 in stage 15.0 (TID 192) in 87 ms on 192.168.85.148 (executor 2) (2/2)
21/08/02 07:18:46 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
21/08/02 07:18:46 INFO DAGScheduler: ResultStage 15 (collect at ClusteringSummary.scala:49) finished in 0.140 s
21/08/02 07:18:46 INFO DAGScheduler: Job 8 is finished. Cancelling potential speculative or zombie tasks for this job
21/08/02 07:18:46 INFO TaskSchedulerImpl: Killing all running tasks in stage 15: Stage finished
21/08/02 07:18:46 INFO DAGScheduler: Job 8 finished: collect at ClusteringSummary.scala:49, took 2.736976 s
21/08/02 07:18:46 INFO CodeGenerator: Code generated in 12.199455 ms
21/08/02 07:18:46 INFO Instrumentation: [825afca3] {"clusterSizes":"[286782,121453,119044,123869,600000,313218,600000,600000,118074,117560]"}
21/08/02 07:18:46 INFO Instrumentation: [825afca3] training finished
Training time (ms) = 22674
Total cost = 1.9109701621924158E11.
21/08/02 07:18:46 INFO SparkUI: Stopped Spark web UI at http://spark-client-headless-service.default.svc.cluster.local:4040

@zhixingheyi-tian
Copy link
Collaborator Author

@xwu99 @carsonwang , please have a check when available.

Thanks

@xwu99
Copy link
Collaborator

xwu99 commented Aug 3, 2021

What version are you testing? The master branch has some dependencies of GPU libraries that I am working to fix in #111
Are you using K8S? KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Disabling executor 1.
We didn't test on K8S. I am not sure if there is different way to set classpath for K8S.
Could you also send out stdout / stderr info?

@xwu99
Copy link
Collaborator

xwu99 commented Aug 4, 2021

@zhixingheyi-tian Pls try the latest master branch to see if everything ok. Also @haojinIntel
Be notice to use ./build.sh -p spark-3.1.1 to build. There is an additional -p comparing to the old version. Pls check updated README for the dependency versions. CPU only version will be built by default.

@zhixingheyi-tian
Copy link
Collaborator Author

zhixingheyi-tian commented Aug 5, 2021

@xwu99

Now our team Cloud integration testings (include on EMR, Dataproc, EKS, etc.) are for branch-1.2. Cloud integration is also the target of 1.2 release.
From @yao531441 compile script,have included the " -Pspark-3.1.1"

oap-mllib )
    cd $OAP_HOME/oap-mllib/mllib-dal
    source /opt/intel/oneapi/setvars.sh
    source /tmp/oneCCL/build/_install/env/setvars.sh
    mvn clean package  -Dmaven.test.skip=true  -Pspark-3.1.1
    ;;

"EKS" just is Amazon Elastic Kubernetes Service. We used K8S scheduler to launch Spark Executor. From the "KMeansDAL" log, the oap-mllib jars have enabled in classpath.

The driver log is shown above.

The executor log before killed is as below:

21/08/02 07:37:27 INFO CoarseGrainedExecutorBackend: Got assigned task 106
21/08/02 07:37:27 INFO Executor: Running task 1.0 in stage 9.0 (TID 106)
21/08/02 07:37:27 INFO TorrentBroadcast: Started reading broadcast variable 10 with 1 pieces (estimated total size 4.0 MiB)
21/08/02 07:37:27 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 16.8 KiB, free 35.9 MiB)
21/08/02 07:37:27 INFO TorrentBroadcast: Reading broadcast variable 10 took 20 ms
21/08/02 07:37:27 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.8 KiB, free 35.8 MiB)
21/08/02 07:37:27 INFO BlockManager: Found block rdd_39_1 locally
21/08/02 07:37:27 INFO OneCCL: Initializing with IP_PORT: 192.168.82.1_3000
OneCCL (native): init
2021:08:02-07:37:27:(43) |WARN| hostname is quite long, len: 63, name: sinput3000000samples105memoryonlyrandom-9ae02e7b05c8a699-exec-1
KVS get error: timeout limit (60 > 60), prefix: CCL_POD_ADDR1, key atl-ofi-hostname-0

I doubt if there are some limitations for Spark configurations. For example: "executor.memory".
Now ,test dataset is small scale of HiBench, just less than 1G. Does OAP MLlib need specific conf items for small datasets?

@xwu99
Copy link
Collaborator

xwu99 commented Aug 5, 2021

@zhixingheyi-tian From the log, it's not related to memory. oneCCL ranks are not about to connect to the rank 0 which is the executor listening in 192.168.82.1 port 3000. There is some network issue. Maybe the port is blocked or the IP is not reachable.
And there is a warning: |WARN| hostname is quite long, len: 63, name: sinput3000000samples105memoryonlyrandom-9ae02e7b05c8a699-exec-1. I am not sure if it's the cause of bad hostname resolve. Could you check the DNS, hostname, firewall, routing etc. network related configs ?

@xwu99
Copy link
Collaborator

xwu99 commented Aug 5, 2021

btw. It's encouraged to use build.sh instead of just using mvn to build. Some additional environments need to be set before calling mvn.

@zhixingheyi-tian
Copy link
Collaborator Author

@xwu99
As baseline, we also started the vanilla Spark testing, with same configs and same workload. The running is smooth and quick. Can refer to previous description. So the network issue can be excluded.

@zhixingheyi-tian
Copy link
Collaborator Author

Add "spark.shuffle.reduceLocality.enabled=false" conf item,
We verify two cases: two executors, and one executors, still exist issues.

Two executors case, encounter hang issue, on both driver and executor sides.

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

driver hang log

21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_9_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 15.0 KiB, free: 413.9 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.47.215:42623 in memory (size: 1278.0 B, free: 111.0 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 1278.0 B, free: 413.9 MiB)
21/08/10 08:06:16 INFO BlockManagerInfo: Removed broadcast_8_piece0 on 192.168.11.77:41423 in memory (size: 1278.0 B, free: 87.3 MiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on spark-client-headless-service.default.svc.cluster.local:44815 in memory (size: 11.4 KiB, free: 413.9 MiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.47.215:42623 on disk (size: 11.4 KiB)
21/08/10 08:32:18 INFO BlockManagerInfo: Removed broadcast_1_piece0 on 192.168.11.77:41423 on disk (size: 11.4 KiB)

executor hang log

21/08/10 08:05:50 INFO Executor: Running task 1.0 in stage 9.0 (TID 107)
21/08/10 08:05:50 INFO TorrentBroadcast: Started reading broadcast variable 10 with 1 pieces (estimated total size 4.0 MiB)
21/08/10 08:05:50 INFO MemoryStore: Block broadcast_10_piece0 stored as bytes in memory (estimated size 16.8 KiB, free 110.5 MiB)
21/08/10 08:05:50 INFO TorrentBroadcast: Reading broadcast variable 10 took 13 ms
21/08/10 08:05:50 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.9 KiB, free 110.5 MiB)
21/08/10 08:05:50 INFO BlockManager: Found block rdd_39_1 locally
21/08/10 08:05:50 INFO OneCCL: Initializing with IP_PORT: 192.168.47.215_3000
OneCCL (native): init

One executor case, encounter executor restart issue without error log. And only restarted once. And the job completed with the second exector

sh ./spark-submit-client.sh --conf spark.executor.instances=1 \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=1  \
  --conf spark.sql.shuffle.partitions=1  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

Driver log

21/08/10 07:31:38 INFO TaskSetManager: Finished task 11.0 in stage 6.0 (TID 100) in 5095 ms on 192.168.5.51 (executor 2) (18/20)
21/08/10 07:31:43 INFO BlockManagerInfo: Added rdd_6_15 in memory on 192.168.5.51:43073 (size: 19.7 MiB, free: 142.4 MiB)
21/08/10 07:31:44 INFO BlockManagerInfo: Added rdd_37_15 in memory on 192.168.5.51:43073 (size: 24.0 B, free: 142.4 MiB)
21/08/10 07:31:44 INFO TaskSetManager: Finished task 15.0 in stage 6.0 (TID 101) in 5313 ms on 192.168.5.51 (executor 2) (19/20)
21/08/10 07:32:44 WARN HeartbeatReceiver: Removing executor 1 with no recent heartbeats: 137201 ms exceeds timeout 120000 ms
**_21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend: Requesting to kill executor(s) 1
21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend: Actual list of executor(s) to be killed is 1_**
21/08/10 07:32:44 INFO TaskSchedulerImpl: Executor 1 on 192.168.11.77 killed by driver.
21/08/10 07:32:44 INFO TaskSetManager: task 10.0 in stage 6.0 (TID 92) failed because while it was being computed, its executor exited for a reason unrelated to the task. Not counting this failure towards the maximum number of failures for the task.
21/08/10 07:32:44 INFO BlockManagerMasterEndpoint: Trying to remove executor 1 from BlockManagerMaster.
21/08/10 07:32:44 INFO BlockManagerMaster: Removal of executor 1 requested
21/08/10 07:32:44 INFO KubernetesClusterSchedulerBackend$KubernetesDriverEndpoint: Asked to remove non-existent executor 1
21/08/10 07:32:44 INFO TaskSetManager: Starting task 10.1 in stage 6.0 (TID 102) (192.168.5.51, executor 2, partition 10, ANY, 4583 bytes) taskResourceAssignments Map()


Executor exited without error.

21/08/10 07:30:32 INFO MemoryStore: Block rdd_37_8 stored as values in memory (estimated size 24.0 B, free 16.1 MiB)
21/08/10 07:30:32 INFO Executor: Finished task 8.0 in stage 6.0 (TID 90). 1552 bytes result sent to driver
21/08/10 07:30:32 INFO CoarseGrainedExecutorBackend: Got assigned task 91
21/08/10 07:30:32 INFO Executor: Running task 9.0 in stage 6.0 (TID 91)
21/08/10 07:30:32 INFO BlockManager: Found block rdd_6_9 locally
21/08/10 07:30:32 INFO OneDAL: Partition index: 9, numCols: 20, numRows: 168405
21/08/10 07:30:32 INFO OneDAL: Loading native libraries
21/08/10 07:30:32 INFO MemoryStore: Block rdd_37_9 stored as values in memory (estimated size 24.0 B, free 16.1 MiB)
21/08/10 07:30:32 INFO Executor: Finished task 9.0 in stage 6.0 (TID 91). 1552 bytes result sent to driver
21/08/10 07:30:32 INFO CoarseGrainedExecutorBackend: Got assigned task 92
21/08/10 07:30:32 INFO Executor: Running task 10.0 in stage 6.0 (TID 92)
21/08/10 07:30:32 INFO BlockManager: Found block rdd_6_10 locally
21/08/10 07:30:32 INFO OneDAL: Partition index: 10, numCols: 20, numRows: 168920
21/08/10 07:30:32 INFO OneDAL: Loading native libraries

cc @xwu99 @carsonwang

@zhixingheyi-tian
Copy link
Collaborator Author

Add another test to make the domain name shorter with "spark.kubernetes.executor.podNamePrefix=oapmllib"

sh ./spark-submit-client.sh --conf spark.executor.instances=2 \
  --name oap-mllib  \
  --conf spark.kubernetes.executor.podNamePrefix=oapmllib  \
  --conf spark.executor.memoryOverhead=512m  \
  --conf spark.executor.cores=1  \
  --conf spark.default.parallelism=2  \
  --conf spark.sql.shuffle.partitions=2  \
  --conf spark.shuffle.reduceLocality.enabled=false  \
  --conf spark.driver.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --conf spark.executor.extraClassPath=/opt/home/conda/envs/oap-1.2.0.emr630/oap_jars/oap-mllib-1.2.0.jar \
  --class com.intel.hibench.sparkbench.ml.DenseKMeans /opt/home/tools/HiBench/sparkbench/assembly/target/sparkbench-assembly-8.0-SNAPSHOT-dist.jar -k 10 --numIterations 5 --storageLevel MEMORY_ONLY --initMode Random s3a://aws-emr-resources-348941870272-us-east-2/datagen/HiBench/Kmeans/Input/3000000/samples

And the executor also restarted.

21/08/11 06:45:28 INFO TorrentBroadcast: Reading broadcast variable 10 took 26 ms
21/08/11 06:45:28 INFO MemoryStore: Block broadcast_10 stored as values in memory (estimated size 40.8 KiB, free 161.5 MiB)
21/08/11 06:45:28 INFO BlockManager: Found block rdd_39_0 locally
21/08/11 06:45:28 INFO OneCCL: Initializing with IP_PORT: 192.168.60.79_3000
OneCCL (native): init
"client: barrier read data": read: can not process all data, size 4, shift 0

@xwu99 xwu99 changed the title Encounter executors restart issue, when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset [K8S] Encounter executors restart issue, when run Hibench Kmeans workload on AWS EKS environment, with Hibench small scale dataset Aug 12, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants