问题
i am trying to read xml/nested xml in pysaprk uing spark-xml jar.
df = sqlContext.read \
.format("com.databricks.spark.xml")\
.option("rowTag", "hierachy")\
.load("test.xml"
when i execute, dataframe is not creating properly.
+--------------------+
| att|
+--------------------+
|[[1,Data,[Wrapped...|
+--------------------+
xml format i have is mentioned below :
回答1:
heirarchy
should be rootTag and att
should be rowTag as
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "hierarchy") \
.option("rowTag", "att") \
.load("test.xml")
and you should get
+-----+------+----------------------------+
|Order|attval|children |
+-----+------+----------------------------+
|1 |Data |[[[1, Studyval], [2, Site]]]|
|2 |Info |[[[1, age], [2, gender]]] |
+-----+------+----------------------------+
and schema
root
|-- Order: long (nullable = true)
|-- attval: string (nullable = true)
|-- children: struct (nullable = true)
| |-- att: array (nullable = true)
| | |-- element: struct (containsNull = true)
| | | |-- Order: long (nullable = true)
| | | |-- attval: string (nullable = true)
find more information on databricks xml
回答2:
Databricks has released new version to read xml to Spark DataFrame
<dependency>
<groupId>com.databricks</groupId>
<artifactId>spark-xml_2.12</artifactId>
<version>0.6.0</version>
</dependency>
Input XML file I used on this example is available at GitHub repository.
val df = spark.read
.format("com.databricks.spark.xml")
.option("rowTag", "person")
.xml("persons.xml")
Schema
root
|-- _id: long (nullable = true)
|-- dob_month: long (nullable = true)
|-- dob_year: long (nullable = true)
|-- firstname: string (nullable = true)
|-- gender: string (nullable = true)
|-- lastname: string (nullable = true)
|-- middlename: string (nullable = true)
|-- salary: struct (nullable = true)
| |-- _VALUE: long (nullable = true)
| |-- _currency: string (nullable = true)
Outputs:
+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename| salary|
+---+---------+--------+---------+------+--------+----------+---------------+
| 1| 1| 1980| James| M| Smith| null| [10000, Euro]|
| 2| 6| 1990| Michael| M| null| Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+
Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations
Hope this helps !!
回答3:
You can use Databricks jar to parse the xml to a dataframe. You can use maven or sbt to compile the dependency or you can directly use the jar with spark submit.
pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar
df = spark.read \
.format("com.databricks.spark.xml") \
.option("rootTag", "SmsRecords") \
.option("rowTag", "sms") \
.load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")
Schema>>> df.printSchema()
root
|-- address: string (nullable = true)
|-- body: string (nullable = true)
|-- date: long (nullable = true)
|-- type: long (nullable = true)
>>> df.select("address").distinct().count()
530
Follow this http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html
来源:https://stackoverflow.com/questions/50429315/read-xml-in-spark