Why do multiple print() methods in Spark Streaming affect the values in my list?

痴心易碎 提交于 2019-12-14 04:18:58

问题


I'm trying to receive one JSON line per two seconds, store them in a List which has elements from a costum Class, created by me, and print the resulting List after each execution of the context. So I'm doing something like this:

JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));

    JavaReceiverInputDStream<String> streamData = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
            StorageLevels.MEMORY_AND_DISK_SER);


    JavaDStream<LinkedList<StreamValue>> getdatatoplace = streamData.map(new Function<String, LinkedList<StreamValue>>() {
                @Override
                public LinkedList<StreamValue> call(String s) throws Exception {
                    //Access specific attributes in the JSON
                    Gson gson = new Gson();
                    Type type = new TypeToken<Map<String, String>>() {
                    }.getType();
                    Map<String, String> retMap = gson.fromJson(s, type);
                    String a = retMap.get("exp");

                    String idValue = retMap.get("id");

                    //insert values into the stream_Value LinkedList
                    stream_Value.push(new StreamValue(idValue, a, UUID.randomUUID()));


                    return stream_Value;

                }
            });

getdatatoplace.print();

This works very well, and I get the following result:

//at the end of the first batch duration/cycle
 getdatatoplace[]={json1}

//at the end of the second batch duration/cycle
 getdatatoplace[]={json1,json2}

...

However, if I do multiple prints of getdatatoplace, let's say 3:

 getdatatoplace.print();
 getdatatoplace.print();
 getdatatoplace.print();

then I get this result:

 //at the end of the first print
 getdatatoplace[]={json1}

//at the end of the second print
 getdatatoplace[]={json1,json1}

 //at the end of the third print
 getdatatoplace[]={json1,json1,json1}

//Context ends with getdatatoplace.size()=3

//New cycle begins, and I get a new value json2

 //at the end of the first print
 getdatatoplace[]={json1,json1,json1,json2}
...

So what happens is that, for each print that I do, even though I do stream_Value.push before, and the commands I gave in my batch duration haven't ended yet, stream_Value pushes values to my List for every print that I do.

My question is, why does this happen, and how do I make it so that, independentely of the number of print() methods I use, I get just one JSON line stored in my list per Batch Duration/per execution.

I hope I was not confusing, as I am new to Spark and may have confused some of the vocabulary. Thank you so much.

PS: Even if I print another DStream, the same thing happens. Say I do this, each with same 'architecture' of the Stream above:

JavaDStream1.print();
JavaDStream2.print();

At the end of JavaDStream2.print(), the list within JavaDstream1 has one extra value.


回答1:


Spark Streaming uses the same computation model as Spark. The operations we declare on the data form a Direct Acyclic Graph (DAG) that's evaluated when actions are used to materialize such computations on the data.

In Spark Streaming, output operations, such as print() will schedule the materialization of these operations at every batch interval.

The DAG for this Stream would look something like this:

[TextStream]->[map]->[print]

print will schedule the map operation on the data received by the socketTextStream. When we add more print actions, our DAG looks like:

            /->[map]->[print]
[TextStream] ->[map]->[print]
            \->[map]->[print] 

And here the issue should become visible. The map operation is executed three times. That's expected behavior and normally not an issue, because map is supposed to be a stateless transformation.

The root cause of the problem here is that map contains a mutation operation, as it adds elements to a global collection stream_Value defined outside the scope of the function passed to map.

This not only causes the duplication issues, but will not work in general, when Spark Streaming runs in its usual cluster mode.



来源:https://stackoverflow.com/questions/37000348/why-do-multiple-print-methods-in-spark-streaming-affect-the-values-in-my-list

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!