I did come across a mini tutorial for data preprocessing using spark here: http://ampcamp.berkeley.edu/big-data-mini-course/featurization.html
However, this discusses on
It looks like somebody made an xml datasource for apache-spark.
https://github.com/databricks/spark-xml
This supports to read XML files by specifying tags and infer types e.g.
import org.apache.spark.sql.SQLContext
val sqlContext = new SQLContext(sc)
val df = sqlContext.read
.format("com.databricks.spark.xml")
.option("rowTag", "book")
.load("books.xml")
You can also use it with spark-shell
as below:
$ bin/spark-shell --packages com.databricks:spark-xml_2.11:0.3.0