Skip to content
This repository has been archived by the owner on Jan 9, 2020. It is now read-only.

SparkPi Example: java.nio.channels.UnresolvedAddressException #523

Closed
apurvann opened this issue Oct 11, 2017 · 22 comments
Closed

SparkPi Example: java.nio.channels.UnresolvedAddressException #523

apurvann opened this issue Oct 11, 2017 · 22 comments

Comments

@apurvann
Copy link

I am trying to follow the official documentation for running the SparkPi Example and encountering the following problem.

We create the kubernetes cluster using Magnum (https://wiki.openstack.org/wiki/Magnum)

And run the example as per the following command:

bin/spark-submit \
  --deploy-mode cluster \
  --master k8s://https://<MY-KUBE-MASTER-IP>:6443 \
  --kubernetes-namespace default \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf spark.kubernetes.driver.docker.image=kubespark/spark-driver-py:v2.2.0-kubernetes-0.4.0 \
  --conf spark.kubernetes.executor.docker.image=kubespark/spark-executor-py:v2.2.0-kubernetes-0.4.0 \
  --jars local:///opt/spark/examples/jars/spark-examples_2.11-2.2.0-k8s-0.4.0.jar \
    local:///opt/spark/examples/src/main/python/pi.py 10

Has anyone ran into this before? Which address does it refer to, considering I have already given it the master IP?

2017-10-11 16:35:52 INFO SparkContext:54 - Running Spark version 2.2.0-k8s-0.4.0 2017-10-11 16:35:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2017-10-11 16:35:52 INFO SparkContext:54 - Submitted application: PythonPi 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls to: root 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing view acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - Changing modify acls groups to: 2017-10-11 16:35:52 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); groups with view permissions: Set(); users with modify permissions: Set(root); groups with modify permissions: Set() 2017-10-11 16:35:53 ERROR SparkContext:91 - Error initializing SparkContext. java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) 2017-10-11 16:35:53 INFO SparkContext:54 - Successfully stopped SparkContext Traceback (most recent call last): File "/opt/spark/examples/src/main/python/pi.py", line 32, in <module> .appName("PythonPi")\ File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/session.py", line 169, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 334, in getOrCreate File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 118, in __init__ File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 180, in _do_init File "/opt/spark/python/lib/pyspark.zip/pyspark/context.py", line 273, in _initialize_context File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1401, in __call__ File "/opt/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 319, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.api.java.JavaSparkContext. : java.nio.channels.UnresolvedAddressException at sun.nio.ch.Net.checkAddress(Net.java:101) at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218) at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127) at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501) at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218) at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496) at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481) at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965) at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210) at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353) at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399) at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Exception in thread "main" org.apache.spark.SparkUserAppException: User application exited with 1 at org.apache.spark.deploy.PythonRunner$.main(PythonRunner.scala:97) at org.apache.spark.deploy.PythonRunner.main(PythonRunner.scala)

@apurvann apurvann changed the title SparkPi Example: SparkPi Example: java.nio.channels.UnresolvedAddressException Oct 11, 2017
@mccheah
Copy link

mccheah commented Oct 11, 2017

In the driver pod, the driver attempts to connect o kubernetes.default.svc which in turn requires kube-dns to be installed.

@mccheah
Copy link

mccheah commented Oct 11, 2017

Also can we try this with Scala's SparkPi instead of Python?

@kimoonkim
Copy link
Member

Just to add one more data point, I also saw this same error occurring during an integration test off #514. I ran it manually inside my IntelliJ and saw it happening. And the minikube had the kube-dns running.

$ kubectl logs -n 7f8bc76fe2084a6e9259bc491f85f3d9 spark-test-app-1507154737198-driver
2017-10-04 22:04:04 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
at java.lang.Thread.run(Thread.java:748)
Exception in thread “main” java.nio.channels.UnresolvedAddressException
at sun.nio.ch.Net.checkAddress(Net.java:101)
at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
2017-10-04 22:04:04 INFO SparkContext:54 - Successfully stopped SparkContext
at java.lang.Thread.run(Thread.java:748)

@ifilonenko
Copy link
Member

ifilonenko commented Oct 12, 2017

Just to checkpoint a working state where I have not seen this error. Upon merging #414 this error did not exist. Seems to have been introduced via most recent pushes maybe: #483 @mccheah @foxish

@mccheah
Copy link

mccheah commented Oct 12, 2017

Can we double check that the headless service is actually being created in the cases where this error occurs? kubectl get svc, kubectl describe svc, etc.

@kimoonkim
Copy link
Member

Sure, let me try to reproduce this and double-check if the headless service exists.

@apurvann
Copy link
Author

@mccheah In my case, I do not have kube-dns running, I don't know how to enable it because we use Magnum for deploying the cluster. I did run the SparkPi example using Scala though, but no luck either.

@mccheah
Copy link

mccheah commented Oct 16, 2017

Can we see the stack trace for the Scala job?

@mccheah
Copy link

mccheah commented Oct 16, 2017

Actually, @kimoonkim has given us a stack trace already - would like to see if it's the same as what @apurva3000 is seeing.

@SOmeONee
Copy link

SOmeONee commented Oct 17, 2017

In DriverAddressConfigurationStep.scala, driver hostname is:
val driverHostname = s"${driverService.getMetadata.getName}.$namespace.svc.cluster.local"
so, kubelet should use:
cluster_domain= cluster.local

But I think this should be passed by config .

@kimoonkim
Copy link
Member

I just tried to reproduce this in the same way as before, but this time it succeeded. Here's kubectl describe service output:

$ kubectl describe service -n 31c2f493de2641369b2761a0e44fa28c spark-test-app-1508285351080-driver-svc
Name:			spark-test-app-1508285351080-driver-svc
Namespace:		31c2f493de2641369b2761a0e44fa28c
Labels:			<none>
Annotations:		<none>
Selector:		spark-app-locator=05bb57e0f1084af2b2370d6c04d1c376,spark-app-selector=spark-37307be4c1664e9ea0ee6794c374817d,spark-role=driver
Type:			ClusterIP
IP:			None
Port:			driver-rpc-port	7078/TCP
Endpoints:		172.17.0.4:7078
Port:			blockmanager	7079/TCP
Endpoints:		172.17.0.4:7079
Session Affinity:	None
Events:			<none>

I wonder if we have a race condition here. If kube-dns is slow to add an entry for the headless service, then the driver fails while attempting to listen on the address. Maybe the driver should just use the underlying pod IP address instead of the service address for listening? (I don't know if this suggestion makes sense)

@ifilonenko
Copy link
Member

re: @kimoonkim [@foxish doesn't the service address resolve to the appropriate pod IP address? And could it be the case that there is such a race condition? I don't understand also why it would sometimes fail and other times not]

@apurvann
Copy link
Author

@mccheah I am seeing the same stack trace

2017-10-18 08:11:59 ERROR SparkContext:91 - Error initializing SparkContext.
java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)
2017-10-18 08:11:59 INFO  SparkContext:54 - Successfully stopped SparkContext
Exception in thread "main" java.nio.channels.UnresolvedAddressException
	at sun.nio.ch.Net.checkAddress(Net.java:101)
	at sun.nio.ch.ServerSocketChannelImpl.bind(ServerSocketChannelImpl.java:218)
	at io.netty.channel.socket.nio.NioServerSocketChannel.doBind(NioServerSocketChannel.java:127)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.bind(AbstractChannel.java:501)
	at io.netty.channel.DefaultChannelPipeline$HeadContext.bind(DefaultChannelPipeline.java:1218)
	at io.netty.channel.AbstractChannelHandlerContext.invokeBind(AbstractChannelHandlerContext.java:496)
	at io.netty.channel.AbstractChannelHandlerContext.bind(AbstractChannelHandlerContext.java:481)
	at io.netty.channel.DefaultChannelPipeline.bind(DefaultChannelPipeline.java:965)
	at io.netty.channel.AbstractChannel.bind(AbstractChannel.java:210)
	at io.netty.bootstrap.AbstractBootstrap$2.run(AbstractBootstrap.java:353)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:399)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:446)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144)
	at java.lang.Thread.run(Thread.java:748)

Hope this helps.

@liyinan926
Copy link
Member

Regarding the race condition suggested by @kimoonkim, the submission client creates the driver pod before creating the headless service along with other Kubernetes resources the driver depends on. @apurva3000 as @mccheah suggested, can you run kubectl get svc to see if the service was indeed created when the error occurred?

@kimoonkim
Copy link
Member

@liyinan926 Sure, that ordering exists. But does it mean the headless service DNS entry is actually created in kube-dns before the driver pod starts? The client code or even k8s API server may just send an asynchronous request to kube-dns without waiting for the dns entry create operation to complete.

@liyinan926
Copy link
Member

@kimoonkim AFAIK, the endpoints controller watches for services and create corresponding endpoints when applicable. In this case, the headless service has selector so the endpoint controller will create the endpoints once it sees the service.

@kimoonkim
Copy link
Member

But the endpoint backing the service in this case is the driver pod itself. So is it possible for whoever watches the endpoint or kube-dns can be slower than the the driver JVM and create a DNS entry too late? It's not the first time that slow DNS affected us. I won't be surprised.

@liyinan926
Copy link
Member

@kimoonkim I think it's possible that the driver pod starts and tries to bind to the host name derived from the service name before the endpoint controller creates the endpoint and modifies the DNS configuration if the latter is slow.

@apurvann
Copy link
Author

apurvann commented Oct 19, 2017

@liyinan926 Okay I tried describing the service as well and this is what I get:

Name:              spark-pi-1508314311201-driver-svc
Namespace:         default
Labels:            <none>
Annotations:       <none>
Selector:          spark-app-selector=spark-c3d3cf27405f4997ab445817ea0e1ec9,spark-role=driver
Type:              ClusterIP
IP:                None
Port:              driver-rpc-port  7078/TCP
TargetPort:        7078/TCP
Endpoints:         
Port:              blockmanager  7079/TCP
TargetPort:        7079/TCP
Endpoints:         
Session Affinity:  None
Events:            <none>


So, clearly the endpoints are blank in my output, what exactly am I missing here, the fact that I do not have kube-dns running? (And also the fact that I do not know how to enable that in OpenStack Magnum)

@kimoonkim
Copy link
Member

Oh, if kube-dns is not running, then I think this won't work at all. Curious how come your cluster does not have kube-dns. I thought kube-dns is a required component for k8s these days.

@liyinan926
Copy link
Member

Yes, without the kube-dns addon, this won't work. The driver won't be able to resolve the fully-qualified driver service name ({driver-service}.{namespace}.svc.cluster.local) to the corresponding IP address it is trying to bind to.

@liyinan926
Copy link
Member

liyinan926 commented Oct 19, 2017

@kimoonkim @mccheah @ifilonenko I think @kimoonkim's point above of using the underlying driver pod's IP address makes sense as anyway the derived driver host name (${driverService.getMetadata.getName}.$namespace.svc.cluster.local) resolves to that IP address. By using the IP address directly, we can avoid the possible race condition. With this, the submission client still creates the headless service, but sets spark.driver.bindAddress to the IP address of the driver pod. spark.driver.host will still use the derived host name though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants