问题
I'm trying to receive one JSON line per two seconds, store them in a List which has elements from a costum Class, created by me, and print the resulting List after each execution of the context. So I'm doing something like this:
JavaStreamingContext ssc = new JavaStreamingContext(sparkConf, Durations.seconds(2));
JavaReceiverInputDStream<String> streamData = ssc.socketTextStream(args[0], Integer.parseInt(args[1]),
StorageLevels.MEMORY_AND_DISK_SER);
JavaDStream<LinkedList<StreamValue>> getdatatoplace = streamData.map(new Function<String, LinkedList<StreamValue>>() {
@Override
public LinkedList<StreamValue> call(String s) throws Exception {
//Access specific attributes in the JSON
Gson gson = new Gson();
Type type = new TypeToken<Map<String, String>>() {
}.getType();
Map<String, String> retMap = gson.fromJson(s, type);
String a = retMap.get("exp");
String idValue = retMap.get("id");
//insert values into the stream_Value LinkedList
stream_Value.push(new StreamValue(idValue, a, UUID.randomUUID()));
return stream_Value;
}
});
getdatatoplace.print();
This works very well, and I get the following result:
//at the end of the first batch duration/cycle
getdatatoplace[]={json1}
//at the end of the second batch duration/cycle
getdatatoplace[]={json1,json2}
...
However, if I do multiple prints of getdatatoplace
, let's say 3:
getdatatoplace.print();
getdatatoplace.print();
getdatatoplace.print();
then I get this result:
//at the end of the first print
getdatatoplace[]={json1}
//at the end of the second print
getdatatoplace[]={json1,json1}
//at the end of the third print
getdatatoplace[]={json1,json1,json1}
//Context ends with getdatatoplace.size()=3
//New cycle begins, and I get a new value json2
//at the end of the first print
getdatatoplace[]={json1,json1,json1,json2}
...
So what happens is that, for each print that I do, even though I do stream_Value.push
before, and the commands I gave in my batch duration haven't ended yet, stream_Value
pushes values to my List for every print that I do.
My question is, why does this happen, and how do I make it so that, independentely of the number of print() methods I use, I get just one JSON line stored in my list per Batch Duration/per execution.
I hope I was not confusing, as I am new to Spark and may have confused some of the vocabulary. Thank you so much.
PS: Even if I print another DStream, the same thing happens. Say I do this, each with same 'architecture' of the Stream above:
JavaDStream1.print();
JavaDStream2.print();
At the end of JavaDStream2.print(), the list within JavaDstream1 has one extra value.
回答1:
Spark Streaming uses the same computation model as Spark. The operations we declare on the data form a Direct Acyclic Graph (DAG) that's evaluated when actions are used to materialize such computations on the data.
In Spark Streaming, output operations, such as print()
will schedule the materialization of these operations at every batch interval.
The DAG for this Stream would look something like this:
[TextStream]->[map]->[print]
print
will schedule the map
operation on the data received by the socketTextStream
. When we add more print
actions, our DAG looks like:
/->[map]->[print]
[TextStream] ->[map]->[print]
\->[map]->[print]
And here the issue should become visible. The map
operation is executed three times. That's expected behavior and normally not an issue, because map
is supposed to be a stateless transformation.
The root cause of the problem here is that map contains a mutation operation, as it adds elements to a global collection stream_Value
defined outside the scope of the function passed to map
.
This not only causes the duplication issues, but will not work in general, when Spark Streaming runs in its usual cluster mode.
来源:https://stackoverflow.com/questions/37000348/why-do-multiple-print-methods-in-spark-streaming-affect-the-values-in-my-list