javax.xml.stream.XMLStreamException: Trying to output second root Spark-XML Spark Program

空扰寡人 提交于 2019-12-11 06:03:21

问题


I am trying to run this small spark-xml example and it fails with exception when i do a spark-submit.

Sample REPO : https://github.com/punithmailme/spark-xml-new

command : ./dse spark-submit --class MainDriver /Users/praj3/Desktop/projects/spark/main/build/libs/main.jar

import java.io.Serializable;
import java.nio.file.Paths;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import lombok.Builder;
import lombok.Data;
import org.apache.hadoop.conf.Configuration;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.sql.Dataset;
import org.apache.spark.sql.Row;
import org.apache.spark.sql.RowFactory;
import org.apache.spark.sql.SQLContext;
import org.apache.spark.sql.SaveMode;
import org.apache.spark.sql.types.DataTypes;
import org.apache.spark.sql.types.Metadata;
import org.apache.spark.sql.types.StructField;
import org.apache.spark.sql.types.StructType;

public class MainDriver {

  public static void main(String[] args) {
    SparkConf sparkConf = new SparkConf();
    sparkConf.setAppName("Unit Test");
    sparkConf.setMaster("local[2]");

    JavaSparkContext javaSparkContext = new JavaSparkContext(sparkConf);
    SQLContext sqlCon = new SQLContext(javaSparkContext);

    final JavaRDD<Book> parallelize = javaSparkContext
        .parallelize(Arrays.asList(Book.builder().name("2341").test("34").build(),
            Book.builder().name("2341").test("34").build()));

    final JavaRDD<Row> map = parallelize.map(book -> RowFactory.create(
        book.getName(),
        book.getTest()
    ));

    List<StructField> fields = new ArrayList<>();
    fields.add(new StructField("Name", DataTypes.StringType, true, Metadata.empty()));
    fields.add(new StructField("Test", DataTypes.StringType, true, Metadata.empty()));
    final StructType structType = DataTypes.createStructType(fields);

    final Dataset<Row> dataFrame = sqlCon.createDataFrame(map, structType);

    dataFrame
        .repartition(1)
        .write()
        .format("com.databricks.spark.xml")
        .mode(SaveMode.Overwrite)
        .option("rootTag", "n:Brands")
        .option("rowTag", "n:Brand")
        .save("new");

  }



}


@Data
@Builder
class Book implements Serializable {

  private final String name;
  private final String test;
}

Exception..

Caused by: javax.xml.stream.XMLStreamException: Trying to output second root, <n:Brand>
    at com.ctc.wstx.sw.BaseStreamWriter.throwOutputError(BaseStreamWriter.java:1537) ~[woodstox-core-asl-4.4.1.jar:4.4.1]
    at com.ctc.wstx.sw.BaseStreamWriter.throwOutputError(BaseStreamWriter.java:1544) ~[woodstox-core-asl-4.4.1.jar:4.4.1]
    at com.ctc.wstx.sw.BaseStreamWriter.reportNwfStructure(BaseStreamWriter.java:1572) ~[woodstox-core-asl-4.4.1.jar:4.4.1]
    at com.ctc.wstx.sw.BaseNsStreamWriter.checkStartElement(BaseNsStreamWriter.java:469) ~[woodstox-core-asl-4.4.1.jar:4.4.1]
    at com.ctc.wstx.sw.BaseNsStreamWriter.writeStartElement(BaseNsStreamWriter.java:290) ~[woodstox-core-asl-4.4.1.jar:4.4.1]
    at com.sun.xml.internal.txw2.output.DelegatingXMLStreamWriter.writeStartElement(DelegatingXMLStreamWriter.java:45) ~[na:1.8.0_144]
    at com.sun.xml.internal.txw2.output.IndentingXMLStreamWriter.writeStartElement(IndentingXMLStreamWriter.java:148) ~[na:1.8.0_144]
    at com.databricks.spark.xml.parsers.StaxXmlGenerator$.apply(StaxXmlGenerator.scala:128) ~[main.jar:na]
    at com.databricks.spark.xml.util.XmlFile$$anonfun$1$$anon$1.next(XmlFile.scala:108) ~[main.jar:na]
    at com.databricks.spark.xml.util.XmlFile$$anonfun$1$$anon$1.next(XmlFile.scala:96) ~[main.jar:na]
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410) ~[scala-library-2.11.11.jar:na]

Environment and Dependecny DataStax Enterprise 5.1.8 on Mac with below dependencies

[group: 'org.apache.spark', name: 'spark-sql_2.11', version: '2.1.2'],
            [group: 'org.projectlombok', name: 'lombok', version: lombokVersion],
            [group: 'com.datastax.spark', name: 'spark-cassandra-connector_2.11', version: '2.0.7'],
            [group: 'com.fasterxml.jackson.module', name: 'jackson-module-scala_2.11', version: '2.9.5'],
            [group: 'com.fasterxml.jackson.core', name: 'jackson-core', version: '2.9.5'],
            [group: 'com.fasterxml.jackson.core', name: 'jackson-databind', version: '2.9.5'],
    )

    group: 'org.apache.spark', name: 'spark-core_2.11', version: '2.1.2') {
        exclude group: 'org.slf4j', module: 'slf4j-log4j12' //because of log4j and slf conflict
    }


            [group: 'com.databricks', name: 'spark-csv_2.11', version: '1.5.0'],
            [group: 'com.databricks', name: 'spark-xml_2.11', version: '0.4.1']
    )

DSE 5.1.8 Components

  • Apache Cassandra™ 3.11.1.2261
  • Apache Solr™ 6.0.1.0.2224
  • Apache Spark™ 2.0.2.17
  • DSE Java Driver 1.2.6
  • Spark Jobserver 0.6.2.237

When i run this as a main method as a single thread it works, Only on spark-submit it does not work!!!


回答1:


I've tried the sample code on Yarn Cluster and is working fine on AWS EMR with S3.

https://github.com/mkanchwala/spark-databricks-example

Please give it a try and let me know.



来源:https://stackoverflow.com/questions/50131641/javax-xml-stream-xmlstreamexception-trying-to-output-second-root-spark-xml-spar

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!