Can't add jars pyspark in jupyter of Google DataProc

后端 未结 1 1447
小蘑菇
小蘑菇 2021-01-15 08:07

I have a Jupyter notebook on DataProc and I need a jar to run some job. I\'m aware of editting spark-defaults.conf and using the --jars=gs://spark-lib/big

相关标签:
1条回答
  • 2021-01-15 08:18

    Unfortunately there isn't a built-in way to do this dynamically without effectively just editing spark-defaults.conf and restarting the kernel. There's an open feature request in Spark for this.

    Zeppelin has some usability features for adding jars through the UI but even in Zeppelin you have to restart the interpreter after doing so for the Spark context to pick it up in its classloader. And also those options require the jarfiles to already be staged on the local filesystem; you can't just refer to remote file paths or URLs.

    One workaround would be to create an init action which sets up a systemd service which regularly polls on some HDFS directory to sync into one of the existing classpath directories like /usr/lib/spark/jars:

    #!/bin/bash
    # Sets up continuous sync'ing of an HDFS directory into /usr/lib/spark/jars
    
    # Manually copy jars into this HDFS directory to have them sync into
    # ${LOCAL_DIR} on all nodes.
    HDFS_DROPZONE='hdfs:///usr/lib/jars'
    LOCAL_DIR='file:///usr/lib/spark/jars'
    
    readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
    if [[ "${ROLE}" == 'Master' ]]; then
      hdfs dfs -mkdir -p "${HDFS_DROPZONE}"
    fi
    
    SYNC_SCRIPT='/usr/lib/hadoop/libexec/periodic-sync-jars.sh'
    cat << EOF > "${SYNC_SCRIPT}"
    #!/bin/bash
    while true; do
      sleep 5
      hdfs dfs -ls ${HDFS_DROPZONE}/*.jar 2>/dev/null | grep hdfs: | \
        sed 's/.*hdfs:/hdfs:/' | xargs -n 1 basename 2>/dev/null | sort \
        > /tmp/hdfs_files.txt
      hdfs dfs -ls ${LOCAL_DIR}/*.jar 2>/dev/null | grep file: | \
        sed 's/.*file:/file:/' | xargs -n 1 basename 2>/dev/null | sort \
        > /tmp/local_files.txt
      comm -23 /tmp/hdfs_files.txt /tmp/local_files.txt > /tmp/diff_files.txt
      if [ -s /tmp/diff_files.txt ]; then
        for FILE in \$(cat /tmp/diff_files.txt); do
          echo "$(date): Copying \${FILE} from ${HDFS_DROPZONE} into ${LOCAL_DIR}"
          hdfs dfs -cp "${HDFS_DROPZONE}/\${FILE}" "${LOCAL_DIR}/\${FILE}"
        done
      fi
    done
    EOF
    
    chmod 755 "${SYNC_SCRIPT}"
    
    SERVICE_CONF='/usr/lib/systemd/system/sync-jars.service'
    cat << EOF > "${SERVICE_CONF}"
    [Unit]
    Description=Period Jar Sync
    [Service]
    Type=simple
    ExecStart=/bin/bash -c '${SYNC_SCRIPT} &>> /var/log/periodic-sync-jars.log'
    Restart=on-failure
    [Install]
    WantedBy=multi-user.target
    EOF
    
    chmod a+rw "${SERVICE_CONF}"
    
    systemctl daemon-reload
    systemctl enable sync-jars
    systemctl restart sync-jars
    systemctl status sync-jars
    

    Then, whenever you need a jarfile to be available everywhere you just copy the jarfile into hdfs:///usr/lib/jars, and the periodic poller will automatically stick it into /usr/lib/spark/jars and then you simply restart your kernel to pick it up. You can add jars to that HDFS directory either by SSH'ing in and running hdfs dfs -cp directly, or simply subprocess out from your Jupyter notebook:

    import subprocess
    sp = subprocess.Popen(
        ['hdfs', 'dfs', '-cp',
         'gs://spark-lib/bigquery/spark-bigquery-latest.jar',
         'hdfs:///usr/lib/jars/spark-bigquery-latest.jar'],
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE)
    out, err = sp.communicate()
    print(out)
    print(err)
    
    0 讨论(0)
提交回复
热议问题