问题
I have a JSON file in which one of the columns is an XML string.
I tried extracting this field and writing to a file in the first step and reading the file in the next step. But each row has an XML header tag. So the resulting file is not a valid XML file.
How can I use the PySpark XML parser ('com.databricks.spark.xml') to read this string and parse out the values?
The following doesn't work:
tr = spark.read.json( "my-file-path")
trans_xml = sqlContext.read.format('com.databricks.spark.xml').options(rowTag='book').load(tr.select("trans_xml"))
Thanks, Ram.
回答1:
Try Hive XPath UDFs (LanguageManual XPathUDF):
>>> from pyspark.sql.functions import expr
>>> df.select(expr("xpath({0}, '{1}')".format(column_name, xpath_expression)))
or Python UDF:
>>> from pyspark.sql.types import *
>>> from pyspark.sql.functions import udf
>>> import xml.etree.ElementTree as ET
>>> schema = ... # Define schema
>>> def parse(s):
... root = ET.fromstring(s)
result = ... # Select values
... return result
>>> df.select(udf(parse, schema)(xml_column))
来源:https://stackoverflow.com/questions/40445816/load-xml-string-from-column-in-pyspark