Dependency issue with Pyspark running on Kubernetes using spark-on-k8s-operator

问题

I have spent days now trying to figure out a dependency issue I'm experiencing with (Py)Spark running on Kubernetes. I'm using the spark-on-k8s-operator and Spark's Google Cloud connector.

When I try to submit my spark job without a dependency using sparkctl create sparkjob.yaml ... with below .yaml file, it works like a charm.

apiVersion: "sparkoperator.k8s.io/v1beta2"
kind: SparkApplication
metadata:
  name: spark-job
  namespace: my-namespace
spec:
  type: Python
  pythonVersion: "3"
  hadoopConf:
    "fs.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem"
    "fs.AbstractFileSystem.gs.impl": "com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS"
    "fs.gs.project.id": "our-project-id"
    "fs.gs.system.bucket": "gcs-bucket-name"
    "google.cloud.auth.service.account.enable": "true"
    "google.cloud.auth.service.account.json.keyfile": "/mnt/secrets/keyfile.json"
  mode: cluster
  image: "image-registry/spark-base-image"
  imagePullPolicy: Always
  mainApplicationFile: ./sparkjob.py
  deps:
    jars:
      - https://repo1.maven.org/maven2/org/apache/spark/spark-sql-kafka-0-10_2.11/2.4.5/spark-sql-kafka-0-10_2.11-2.4.5.jar
  sparkVersion: "2.4.5"
  restartPolicy:
    type: OnFailure
    onFailureRetries: 3
    onFailureRetryInterval: 10
    onSubmissionFailureRetries: 5
    onSubmissionFailureRetryInterval: 20
  driver:
    cores: 1
    coreLimit: "1200m"
    memory: "512m"
    labels:
      version: 2.4.5
    serviceAccount: spark-operator-spark
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id
  executor:
    cores: 1
    instances: 1
    memory: "512m"
    labels:
      version: 2.4.5
    secrets:
    - name: "keyfile"
      path: "/mnt/secrets"
      secretType: GCPServiceAccount
    envVars:
      GCS_PROJECT_ID: our-project-id

The Docker image spark-base-image is built with Dockerfile

FROM gcr.io/spark-operator/spark-py:v2.4.5

RUN rm $SPARK_HOME/jars/guava-14.0.1.jar
ADD https://repo1.maven.org/maven2/com/google/guava/guava/28.0-jre/guava-28.0-jre.jar $SPARK_HOME/jars

ADD https://repo1.maven.org/maven2/com/google/cloud/bigdataoss/gcs-connector/hadoop2-2.0.1/gcs-connector-hadoop2-2.0.1-shaded.jar $SPARK_HOME/jars

ENTRYPOINT [ "/opt/entrypoint.sh" ]

the main application file is uploaded to GCS when submitting the application and subsequently fetched from there and copied into the driver pod upon starting the application. The problem starts whenever I want to supply my own Python module deps.zip as a dependency to be able to use it in my main application file sparkjob.py.

Here's what I have tried so far:

Added the following lines to spark.deps in sparkjob.yaml

pyFiles:
   - ./deps.zip

which resulted in the operator not being able to even submit the Spark application with error

java.lang.RuntimeException: java.lang.ClassNotFoundException: Class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem not found

./deps.zip is successfully uploaded to the GCS bucket along with the main application file but while the main application file can be successfully fetched from GCS (I see this in the logs in jobs with no dependencies as defined above), ./deps.zip can somehow not be fetched from there. I also tried adding the gcs-connector jar to the spark.deps.jars list explicitly - nothing changes.

I added ./deps.zip to the base docker image used for starting up the driver and executor pods by adding COPY ./deps.zip /mnt/ to the above Dockerfile and adding the dependency in the sparkjob.yaml via

pyFiles:
    - local:///mnt/deps.zip

This time the spark job can be submitted and the driver pod is started, however I get a file:/mnt/deps.zip not found error when the Spark context is being initialized I also tried to additionally set ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile but without any success. I even tried to explicitly mount the whole /mnt/ directory into the driver and executor pods using volume mounts, but that also didn't work.

edit:

My workaround (2), adding dependencies to the Docker image and setting ENV SPARK_EXTRA_CLASSPATH=/mnt/ in the Dockerfile actually worked! Turns out the tag didn't update and I've been using an old version of the Docker image all along. Duh.

I still don't know why the (more elegant) solution 1 via the gcs-connector isn't working, but it might be related to MountVolume.Setup failed for volume "spark-conf-volume"

回答1:

Use the Google Cloud Storage path to the python dependencies since they're uploaded there.

spec:
  deps:
    pyFiles:
      - gs://gcs-bucket-name/deps.zip

来源：https://stackoverflow.com/questions/62448894/dependency-issue-with-pyspark-running-on-kubernetes-using-spark-on-k8s-operator

标签

Docker

apache-spark

Kubernetes

pyspark

dependency-management