Install com.databricks.spark.xml on emr cluster

情到浓时终转凉″ 提交于 2020-04-30 11:43:29

问题


Does anyone knows how do I do to install the com.databricks.spark.xml package on EMR cluster.

I succeeded to connect to master emr but don't know how to install packages on the emr cluster.

code

sc.install_pypi_package("com.databricks.spark.xml")

回答1:


On EMR Master node:

cd /usr/lib/spark/jars
sudo wget https://repo1.maven.org/maven2/com/databricks/spark-xml_2.11/0.9.0/spark-xml_2.11-0.9.0.jar

Make sure to select the correct jar according to your Spark version and the guidelines provided in https://github.com/databricks/spark-xml.

Then, launch your Jupyter notebook and you should be able to run the following:

df = spark.read.format('com.databricks.spark.xml').options(rootTag='objects').options(rowTag='object').load("s3://bucket-name/sample.xml")


来源:https://stackoverflow.com/questions/60298628/install-com-databricks-spark-xml-on-emr-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!