Replies: 1 comment 10 replies
-
@Nikhilpa1, you passed: |
Beta Was this translation helpful? Give feedback.
10 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
While I try to run a spark job on GPU with RAPIDS Shuffle Manager
UCX 1.14
Spark 3.2.0
rapids-4-spark_2.12-22.10.0
with configuration
$SPARK_HOME/bin/spark-submit
--master spark://${MASTER_HOST}:7077
--conf spark.executor.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.driver.extraClassPath=${SPARK_RAPIDS_PLUGIN_JAR}
--conf spark.plugins=com.nvidia.spark.SQLPlugin
--conf spark.shuffle.manager=com.nvidia.spark.rapids.spark320.RapidsShuffleManager
--conf spark.rapids.sql.concurrentGpuTasks=1
--conf spark.driver.memory=64G
--conf spark.executor.memory=64G
--conf spark.executor.cores=1
--conf spark.executor.instances=1
--conf spark.task.cpus=1
--conf spark.executor.resource.gpu.amount=1
--conf spark.task.resource.gpu.amount=0.25
--conf spark.rapids.memory.pinnedPool.size=1G
--conf spark.sql.files.maxPartitionBytes=128m
--conf spark.rapids.shuffle.mode=UCX
--conf spark.shuffle.service.enabled=false
--conf spark.dynamicAllocation.enabled=false
--conf spark.executorEnv.LD_LIBRARY_PATH=/home/kanaka.3/others/ucx/ucx-1.14-ins/lib:/home/kanaka.3/others/knem/knem-1.1.4-ins/lib
--conf spark.driverEnv.LD_LIBRARY_PATH=/home/kanaka.3/others/ucx/ucx-1.14-ins/lib::/home/kanaka.3/others/knem/knem-1.1.4-ins/lib
--conf spark.executorEnv.UCX_ERROR_SIGNALS=
--conf spark.executorEnv.UCX_MEMTYPE_CACHE=n
--conf spark.executorEnv.UCX_IB_RX_QUEUE_LEN=1024
--conf spark.executorEnv.UCX_TLS=rc_x,cuda_copy,cuda_ipc
--conf spark.executorEnv.UCX_RNDV_SCHEME=put_zcopy
--conf spark.executorEnv.UCX_MAX_RNDV_RAILS=1
${SPARK_JOBS}/myjob.py
I'm facing connection error
0/stderr:75:23/04/17 12:06:13 INFO UCXClientConnection: UCX Client UCXClientConnection(ucx=com.nvidia.spark.rapids.shuffle.ucx.UCX@100838be, peerExecutorId=1) started
0/stderr:76:23/04/17 12:06:13 ERROR UCX: UcpListener detected an error for executorId 1: UCXError(-6,Destination is unreachable)
0/stderr:77:23/04/17 12:06:13 WARN UCX: Removing endpoint UcpEndpoint(id=47937671962800, UcpEndpointParams{errorHandlingMode=UCP_ERR_HANDLING_MODE_PEER,socketAddress=/10.1.1.3:4195,) for 1
0/stderr:78:23/04/17 12:06:13 WARN UCX: Removed stale client connection for 1
0/stderr:79:23/04/17 12:06:13 ERROR UCX: Error while closing ep. Ignoring.
0/stderr:80:org.openucx.jucx.UcxException: Destination is unreachable
0/stderr:81: at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingNative(Native Method)
0/stderr:82: at org.openucx.jucx.ucp.UcpEndpoint.closeNonBlockingFlush(UcpEndpoint.java:376)
0/stderr:83: at com.nvidia.spark.rapids.shuffle.ucx.UCX$UcpEndpointManager.$anonfun$closeEndpointOnWorkerThread$1(UCX.scala:900)
0/stderr:84: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5(UCX.scala:184)
0/stderr:85: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$5$adapted(UCX.scala:178)
0/stderr:86: at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
0/stderr:87: at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
0/stderr:88: at com.nvidia.spark.rapids.shuffle.ucx.UCX.withResource(UCX.scala:69)
0/stderr:89: at com.nvidia.spark.rapids.shuffle.ucx.UCX.$anonfun$init$2(UCX.scala:178)
0/stderr:90: at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
0/stderr:91: at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
0/stderr:92: at com.nvidia.spark.rapids.GpuDeviceManager$$anon$1.$anonfun$newThread$1(GpuDeviceManager.scala:345)
0/stderr:93: at java.lang.Thread.run(Thread.java:748)
I've checked on UCX communication by validating the nodes with ucx_perftest. (UCX_NET_DEVICES=mlx5_0:1 UCX_SOCKADDR_TLS_PRIORITY=sockcm UCX_TLS=rc_x,cuda_copy,cuda_ipc ./ucx_perftest -t tag_bw -m cuda -n 4000 -w 500 -c 0 -s 134217728 -p 13381)
Beta Was this translation helpful? Give feedback.
All reactions