Why am I not able to run sparkPi example on a Kubernetes (K8s) cluster?

问题

I have a K8s cluster up and running, on VMs inside VMWare Workstation, as of now. I'm trying to deploy a Spark application natively using the official documentation from here. However, I also landed on this article which made it clearer, I felt.

Now, earlier my setup was running inside nested VMs, basically my machine is on Win10 and I had an Ubuntu VM inside which I had 3 more VMs running for the cluster (not the best idea, I know).

When I tried to run my setup by following the article mentioned, I first created a service account inside the cluster called spark, then created a clusterrolebinding called spark-role, gave edit as the clusterrole and assigned it to the spark service account so that Spark driver pod has sufficient permissions.

I then try to run the example SparkPi job using this command line:

bin/spark-submit \
  --master k8s://https://<k8-cluster-ip>:<k8-cluster-port> \
  --deploy-mode cluster \
  --name spark-pi \
  --class org.apache.spark.examples.SparkPi \
  --conf spark.executor.instances=2 \
  --conf spark.kubernetes.container.image=kmaster:5000/spark:latest \
  --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent \
  --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark \
  local:///opt/spark/examples/jars/spark-examples_2.11-2.4.5.jar 100

And it fails within a few seconds after it has created the driver-pod, it goes into Running state and after like 3 seconds goes into Error state.

On giving the command kubectl logs spark-pi-driver this is the log I get.

The second Caused by: is always either as mentioned above i.e:

Caused by: java.net.SocketException: Broken pipe (Write failed) or,
Caused by: okhttp3.internal.http2.ConnectionShutdownException

Log #2 for reference.

After running into dead-ends with this, I tried giving --deploy-mode client to see if it makes a difference and get more verbose logs. You can read the difference between client and cluster mode from here.

On deploying the job as client mode it still fails, however, now I see that each time the driver pod (now running not as a pod but as a process on the local machine) tries to create an executor pod, it goes into a loop infinitely trying to create an executor pod with a count-number appended to the pod name, as the last one goes into a terminated state. Also, now I can see the Spark UI on the 4040 port but the job doesn't move forward as it's stuck on trying to create even a single executor pod.

I get this log.

To me, this makes it pretty apparent that it's a resource crunch maybe?

So to be sure, I delete the nested VMs and setup 2 new VMs on my main machine and connect them using a NAT network and setup the same K8s cluster.

But now when I try to do the exact same thing it fails with the same error (Broken Pipe/ShutdownException), except now it tells me that it fails even at creating a driver-pod.

This is the log for reference.

Now I can't even fetch logs as to why it fails, because it's never even created.

I've broken my head over this and can't figure out why it's failing. Now, I tried out a lot of things to rule them out but so far nothing has worked except one (which is a completely different solution).

I tried the spark-on-k8-operator from GCP from here and it worked for me. I wasn't able to see the Spark UI as it runs briefly but it prints the Pi value in the shell window, so I know it works. I'm guessing, that even this spark-on-k8s-operator 'internally' does the same thing but I really need to be able to deploy it natively, or at least know why it fails.

Any help here will be appreciated (I know it's a long post). Thank you.

回答1:

Make sure the kubernetes version that you are deploying is compatible with the Spark version that you are using.

Apache Spark uses the Kubernetes Client library to communicate with the kubernetes cluster.

As per today the latest LTS Spark version is 2.4.5 which includes the kubernetes client version 4.6.3.

Checking the compatibility matrix of Kubernetes Client: here

The supported kubernetes versions go all the way up to v1.17.0.

Based on my personal experience Apache Spark 2.4.5 works well with kubernetes version v1.15.3. I have had problems with more recent versions.

When a not supported kubernetes version is used, the logs to get are as the ones you are describing:

Caused by: java.net.SocketException: Broken pipe (Write failed) or,
Caused by: okhttp3.internal.http2.ConnectionShutdownException

回答2:

Faced exact same issue with v1.18.0, downgrading the version to v1.15.3 made it work

minikube start --cpus=4 --memory=4048 --kubernetes-version v1.15.3

来源：https://stackoverflow.com/questions/61565751/why-am-i-not-able-to-run-sparkpi-example-on-a-kubernetes-k8s-cluster

标签

apache-spark

Kubernetes

rbac

kubernetes-pod