Read XML in spark | 易学教程

问题

i am trying to read xml/nested xml in pysaprk uing spark-xml jar.

df = sqlContext.read \
  .format("com.databricks.spark.xml")\
   .option("rowTag", "hierachy")\
   .load("test.xml"

when i execute, dataframe is not creating properly.

    +--------------------+
    |                 att|
    +--------------------+
    |[[1,Data,[Wrapped...|
    +--------------------+

xml format i have is mentioned below :

回答1:

heirarchy should be rootTag and att should be rowTag as

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "hierarchy") \
    .option("rowTag", "att") \
    .load("test.xml")

and you should get

+-----+------+----------------------------+
|Order|attval|children                    |
+-----+------+----------------------------+
|1    |Data  |[[[1, Studyval], [2, Site]]]|
|2    |Info  |[[[1, age], [2, gender]]]   |
+-----+------+----------------------------+

and schema

root
 |-- Order: long (nullable = true)
 |-- attval: string (nullable = true)
 |-- children: struct (nullable = true)
 |    |-- att: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- Order: long (nullable = true)
 |    |    |    |-- attval: string (nullable = true)

find more information on databricks xml

回答2:

Databricks has released new version to read xml to Spark DataFrame

<dependency>
     <groupId>com.databricks</groupId>
     <artifactId>spark-xml_2.12</artifactId>
     <version>0.6.0</version>
 </dependency>

Input XML file I used on this example is available at GitHub repository.

val df = spark.read
      .format("com.databricks.spark.xml")
      .option("rowTag", "person")
      .xml("persons.xml")

Schema

root
 |-- _id: long (nullable = true)
 |-- dob_month: long (nullable = true)
 |-- dob_year: long (nullable = true)
 |-- firstname: string (nullable = true)
 |-- gender: string (nullable = true)
 |-- lastname: string (nullable = true)
 |-- middlename: string (nullable = true)
 |-- salary: struct (nullable = true)
 |    |-- _VALUE: long (nullable = true)
 |    |-- _currency: string (nullable = true)

Outputs:

+---+---------+--------+---------+------+--------+----------+---------------+
|_id|dob_month|dob_year|firstname|gender|lastname|middlename|         salary|
+---+---------+--------+---------+------+--------+----------+---------------+
|  1|        1|    1980|    James|     M|   Smith|      null|  [10000, Euro]|
|  2|        6|    1990|  Michael|     M|    null|      Rose|[10000, Dollor]|
+---+---------+--------+---------+------+--------+----------+---------------+

Note that Spark XML API has some limitations and discussed here Spark-XML API Limitations

Hope this helps !!

回答3:

You can use Databricks jar to parse the xml to a dataframe. You can use maven or sbt to compile the dependency or you can directly use the jar with spark submit.

pyspark --jars /home/sandipan/Downloads/spark_jars/spark-xml_2.11-0.6.0.jar

df = spark.read \
    .format("com.databricks.spark.xml") \
    .option("rootTag", "SmsRecords") \
    .option("rowTag", "sms") \
    .load("/home/sandipan/Downloads/mySMS/Sms/backupinfo.xml")

Schema>>> df.printSchema()
root
 |-- address: string (nullable = true)
 |-- body: string (nullable = true)
 |-- date: long (nullable = true)
 |-- type: long (nullable = true)

>>> df.select("address").distinct().count()
530

Follow this http://www.thehadoopguy.com/2019/09/how-to-parse-xml-data-to-saprk-dataframe.html

来源：https://stackoverflow.com/questions/50429315/read-xml-in-spark

标签

xml

apache-spark

dataframe

pyspark

apache-spark-xml