I have a Jupyter notebook on DataProc and I need a jar to run some job. I\'m aware of editting spark-defaults.conf
and using the --jars=gs://spark-lib/big
Unfortunately there isn't a built-in way to do this dynamically without effectively just editing spark-defaults.conf
and restarting the kernel. There's an open feature request in Spark for this.
Zeppelin has some usability features for adding jars through the UI but even in Zeppelin you have to restart the interpreter after doing so for the Spark context to pick it up in its classloader. And also those options require the jarfiles to already be staged on the local filesystem; you can't just refer to remote file paths or URLs.
One workaround would be to create an init action which sets up a systemd service which regularly polls on some HDFS directory to sync into one of the existing classpath directories like /usr/lib/spark/jars
:
#!/bin/bash
# Sets up continuous sync'ing of an HDFS directory into /usr/lib/spark/jars
# Manually copy jars into this HDFS directory to have them sync into
# ${LOCAL_DIR} on all nodes.
HDFS_DROPZONE='hdfs:///usr/lib/jars'
LOCAL_DIR='file:///usr/lib/spark/jars'
readonly ROLE="$(/usr/share/google/get_metadata_value attributes/dataproc-role)"
if [[ "${ROLE}" == 'Master' ]]; then
hdfs dfs -mkdir -p "${HDFS_DROPZONE}"
fi
SYNC_SCRIPT='/usr/lib/hadoop/libexec/periodic-sync-jars.sh'
cat << EOF > "${SYNC_SCRIPT}"
#!/bin/bash
while true; do
sleep 5
hdfs dfs -ls ${HDFS_DROPZONE}/*.jar 2>/dev/null | grep hdfs: | \
sed 's/.*hdfs:/hdfs:/' | xargs -n 1 basename 2>/dev/null | sort \
> /tmp/hdfs_files.txt
hdfs dfs -ls ${LOCAL_DIR}/*.jar 2>/dev/null | grep file: | \
sed 's/.*file:/file:/' | xargs -n 1 basename 2>/dev/null | sort \
> /tmp/local_files.txt
comm -23 /tmp/hdfs_files.txt /tmp/local_files.txt > /tmp/diff_files.txt
if [ -s /tmp/diff_files.txt ]; then
for FILE in \$(cat /tmp/diff_files.txt); do
echo "$(date): Copying \${FILE} from ${HDFS_DROPZONE} into ${LOCAL_DIR}"
hdfs dfs -cp "${HDFS_DROPZONE}/\${FILE}" "${LOCAL_DIR}/\${FILE}"
done
fi
done
EOF
chmod 755 "${SYNC_SCRIPT}"
SERVICE_CONF='/usr/lib/systemd/system/sync-jars.service'
cat << EOF > "${SERVICE_CONF}"
[Unit]
Description=Period Jar Sync
[Service]
Type=simple
ExecStart=/bin/bash -c '${SYNC_SCRIPT} &>> /var/log/periodic-sync-jars.log'
Restart=on-failure
[Install]
WantedBy=multi-user.target
EOF
chmod a+rw "${SERVICE_CONF}"
systemctl daemon-reload
systemctl enable sync-jars
systemctl restart sync-jars
systemctl status sync-jars
Then, whenever you need a jarfile to be available everywhere you just copy the jarfile into hdfs:///usr/lib/jars
, and the periodic poller will automatically stick it into /usr/lib/spark/jars
and then you simply restart your kernel to pick it up. You can add jars to that HDFS directory either by SSH'ing in and running hdfs dfs -cp
directly, or simply subprocess out from your Jupyter notebook:
import subprocess
sp = subprocess.Popen(
['hdfs', 'dfs', '-cp',
'gs://spark-lib/bigquery/spark-bigquery-latest.jar',
'hdfs:///usr/lib/jars/spark-bigquery-latest.jar'],
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
out, err = sp.communicate()
print(out)
print(err)