问题
I am trying to add max and min to each RDD in a spark dstream..each of it's tuple. I wrote the following code, but can't understand how to pass the parameter min and max. Can anyone suggest a way to do this transformation? I tried the following:
JavaPairDStream<Tuple2<Long, Integer>, Tuple3<Integer,Long,Long>> sortedtsStream = transformedMaxMintsStream.transformToPair(new Sort2());
class MinMax implements Function<JavaPairRDD<Tuple2<Long, Integer>, Integer>, JavaPairRDD<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>>>{
Long max;
Long min;
@Override
public JavaPairRDD<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> call(JavaPairRDD<Tuple2<Long, Integer>, Integer> input) throws Exception {
JavaPairRDD<Tuple2<Long,Integer>,Tuple3<Integer,Long,Long>> output;
max = input.max(new CMP1())._1._1;
min = input.min(new CMP1())._1._1;
output = input.mapToPair(new maptoMinMax());
return output ;
}
class maptoMinMax implements PairFunction<Tuple2<Tuple2<Long, Integer>, Integer>, Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> {
@Override
public Tuple2<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>> call(Tuple2<Tuple2<Long, Integer>, Integer> tuple2IntegerTuple2) throws Exception {
return new Tuple2<Tuple2<Long, Integer>, Tuple3<Integer, Long, Long>>(new Tuple2<Long, Integer>(tuple2IntegerTuple2._1._1,tuple2IntegerTuple2._1._2), new Tuple3<Integer, Long, Long>(tuple2IntegerTuple2._2, max,min));
}
}
}
I get the following error: Essentially seems like min and max functions for JavaPairRDD were not found
15/06/18 11:05:06 INFO BlockManagerInfo: Added input-0-1434639906000 in memory on localhost:42829 (size: 464.0 KB, free: 264.9 MB)
15/06/18 11:05:06 INFO BlockGenerator: Pushed block input-0-1434639906000
Exception in thread "JobGenerator" java.lang.NoSuchMethodError: org.apache.spark.api.java.JavaPairRDD.max(Ljava/util/Comparator;)Lscala/Tuple2;
at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:346)
at org.necla.ngla.spark_streaming.MinMax.call(Type4ViolationChecker.java:340)
at org.apache.spark.streaming.api.java.JavaDStreamLike$class.scalaTransform$3(JavaDStreamLike.scala:360)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at org.apache.spark.streaming.api.java.JavaDStreamLike$$anonfun$transformToPair$1.apply(JavaDStreamLike.scala:361)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$1$$anonfun$apply$21.apply(DStream.scala:654)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:668)
at org.apache.spark.streaming.dstream.DStream$$anonfun$transform$2$$anonfun$5.apply(DStream.scala:666)
at org.apache.spark.streaming.dstream.TransformedDStream.compute(TransformedDStream.scala:41)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1$$anonfun$apply$7.apply(DStream.scala:350)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStream.scala:349)
at org.apache.spark.streaming.dstream.DStream$$anonfun$getOrCompute$1$$anonfun$1.apply(DStrea
回答1:
We can use rdd.transform
to apply several operations on the same RDD to come to our result for each batch interval. We will add this result to each tuple (as per question spec)
data.transform{rdd =>
val mx = rdd.map(x=> (x,x)).reduce{case ((x1,x2),(y1,y2)) => ((x1 min y1), (x2 max y2))}
rdd.map(elem => (elem,mx))
}
This produces an RDD each block interval like (random numbers between 1 and 999 incl):
(258,(0,998)) (591,(0,998)) ...
Java version is semantically identical but quite more verbose due to all those Tuple<...> objects.
来源:https://stackoverflow.com/questions/30902090/adding-max-and-min-in-spark-stream-in-java