Not Serializable exception when integrating Spark SQL and Spark Streaming

点点圈 提交于 2019-12-22 12:27:39

问题


This is my source code where in Im getting some data from the server side, which keeps on generating a stream of data. And then for each RDD , I'm applying the SQL Schema, and once this table is created Im trying to select something from this DStream.

    List<String> males = new ArrayList<String>();
    JavaDStream<String> data = streamingContext.socketTextStream("localhost", (port));
    data.print();
    System.out.println("Socket connection established to read data from Subscriber Server");
    JavaDStream<SubscriberData> streamData = data
            .map(new Function<String, SubscriberData>() {
                public SubscriberData call(String record) {
                    String[] stringArray = record.split(",");
                    SubscriberData subscriberData = new SubscriberData();
                    subscriberData.setMsisdn(stringArray[0]);
                    subscriberData.setSubscriptionType(stringArray[1]);

                    subscriberData.setName(stringArray[2]);
                    subscriberData.setGender(stringArray[3]);
                    subscriberData.setProfession(stringArray[4]);
                    subscriberData.setMaritalStatus(stringArray[5]);


                    return subscriberData;

                }

            });

    streamData.foreachRDD(new Function<JavaRDD<SubscriberData>,Void>(){
        public Void call(JavaRDD<SubscriberData> rdd){

    JavaSQLContext sqlContext = new JavaSQLContext(sc);
    JavaSchemaRDD subscriberSchema = sqlContext.applySchema(rdd,SubscriberData.class);

    subscriberSchema.registerAsTable("SUBSCRIBER_DIMENSION");
    System.out.println("all data");
    JavaSchemaRDD names = sqlContext.sql("SELECT msisdn FROM SUBSCRIBER_DIMENSION WHERE GENDER='Male'");
    System.out.println("afterwards");

    List<String> males = new ArrayList<String>();

     males = names.map(new Function<Row, String>() {
        public String call(Row row) {
            return row.getString(0);
        }
    }).collect();
    System.out.println("before for");
    for (String name : males) {
        System.out.println(name);
            }
    return null;
    }
    });
streamingContext.start();

But it throws this Serializable Exception althought the classes Im using do implement Serialization.

    14/11/06 12:55:20 ERROR scheduler.JobScheduler: Error running job streaming job 1415258720000 ms.1
 org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:166)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:158)
at org.apache.spark.SparkContext.clean(SparkContext.scala:1242)
at org.apache.spark.rdd.RDD.map(RDD.scala:270)
at org.apache.spark.api.java.JavaRDDLike$class.map(JavaRDDLike.scala:75)
at org.apache.spark.sql.api.java.JavaSchemaRDD.map(JavaSchemaRDD.scala:42)
at com.hp.tbda.rta.SubscriberClient$2.call(SubscriberClient.java:206)
at com.hp.tbda.rta.SubscriberClient$2.call(SubscriberClient.java:1)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$foreachRDD$1.apply(JavaDStreamLike.scala:274)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1.apply(DStream.scala:527)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:41)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:172)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.api.java.JavaSparkContext
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1183)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.defaultWriteFields(ObjectOutputStream.java:1547)
at java.io.ObjectOutputStream.writeSerialData(ObjectOutputStream.java:1508)
at java.io.ObjectOutputStream.writeOrdinaryObject(ObjectOutputStream.java:1431)
at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1177)
at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:42)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:73)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:164)
... 20 more

回答1:


SparkContext is not serializable as it's only usable on the driver and should NOT be included in any closure. I'm afraid that support for SQL on Spark Streaming is only at research level at the moment. See this presentation from the Spark Summit for the details.

To create the intended RDD of ids of male subscribers you can use map and filter:

maleSubscribers = subscribers.filter(subsc => subcs.getGender == "Male")
                             .map(subsc => subsc.getMsisdn)


来源:https://stackoverflow.com/questions/26774046/not-serializable-exception-when-integrating-spark-sql-and-spark-streaming

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!