How to read XML files from apache spark framework?

后端 未结 2 710
梦毁少年i
梦毁少年i 2021-02-08 16:02

I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html

However, this discusses on

2条回答
  •  猫巷女王i
    2021-02-08 16:58

    I have not used it myself, but the way would be same as you do it for hadoop. For example you can use StreamXmlRecordReader and process the xmls. The reason you need a record reader is you would like to control the record boundries for each element processed otherwise the default used would process line because it uses LineRecordReader. It would be helpful to get yourself more familiar with concept of recordReader in hadoop.

    And ofcourse you will have to use SparkContext's hadoopRDD or hadoopFile methods with option to pass a InputFormatClass. Incase java is your preferred language, similar alternatives exist.

提交回复
热议问题