How to read XML files from apache spark framework?

后端 未结 3 1109
深忆病人
深忆病人 2021-02-08 16:25

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

However, this discusses on

3条回答
  •  南笙
    南笙 (楼主)
    2021-02-08 16:50

    It looks like somebody made an xml datasource for apache-spark.

    https://github.com/databricks/spark-xml

    This supports to read XML files by specifying tags and infer types e.g.

    import org.apache.spark.sql.SQLContext
    
    val sqlContext = new SQLContext(sc)
    val df = sqlContext.read
        .format("com.databricks.spark.xml")
        .option("rowTag", "book")
        .load("books.xml")
    

    You can also use it with spark-shell as below:

    $ bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0
    

提交回复
热议问题