How to read XML files from apache spark framework?

后端 未结 2 691
梦毁少年i
梦毁少年i 2021-02-08 16:02

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

However, this discusses on

2条回答
  •  [愿得一人]
    2021-02-08 16:40

    It looks like somebody made an xml datasource for apache-spark.

    https://github.com/databricks/spark-xml

    This supports to read XML files by specifying tags and infer types e.g.

    import org.apache.spark.sql.SQLContext
    
    val sqlContext = new SQLContext(sc)
    val df = sqlContext.read
        .format("com.databricks.spark.xml")
        .option("rowTag", "book")
        .load("books.xml")
    

    You can also use it with spark-shell as below:

    $ bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0
    

提交回复
热议问题