问题
I look for a way to retrieve the first elements of a DStream created as:
val dstream = ssc.textFileStream(args(1)).map(x => x.split(",").map(_.toDouble))
Unfortunately, there is no take function (as on RDD) on a dstream //dstream.take(2) !!!
Could someone has any idea on how to do it ?! thanks
回答1:
You can use transform method in the DStream object then take n elements of the input RDD and save it to a list, then filter the original RDD to be contained in this list. This will return a new DStream contains n elements.
val n = 10
val partOfResult = dstream.transform(rdd => {
val list = rdd.take(n)
rdd.filter(list.contains)
})
partOfResult.print
回答2:
The previous suggested solution did not compile for me as the take() method returns an Array, which is not serializable thus Spark streaming will fail with a java.io.NotSerializableException.
A simple variation on the previous code that worked for me:
val n = 10
val partOfResult = dstream.transform(rdd => {
rdd.filter(rdd.take(n).toList.contains)
})
partOfResult.print
回答3:
Sharing a java based solution that is working for me. The idea is to use a custom function, which can send the top row from a sorted RDD.
someData.transform(
rdd ->
{
JavaRDD<CryptoDto> result =
rdd.keyBy(Recommendations.volumeAsKey)
.sortByKey(new CryptoComparator()).values().zipWithIndex()
.map(row ->{
CryptoDto purchaseCrypto = new CryptoDto();
purchaseCrypto.setBuyIndicator(row._2 + 1L);
purchaseCrypto.setName(row._1.getName());
purchaseCrypto.setVolume(row._1.getVolume());
purchaseCrypto.setProfit(row._1.getProfit());
purchaseCrypto.setClose(row._1.getClose());
return purchaseCrypto;
}
).filter(Recommendations.selectTopinSortedRdd);
return result;
}).print();
The custom function selectTopinSortedRdd
looks like below:
public static Function<CryptoDto, Boolean> selectTopInSortedRdd = new Function<CryptoDto, Boolean>() {
private static final long serialVersionUID = 1L;
@Override
public Boolean call(CryptoDto value) throws Exception {
if (value.getBuyIndicator() == 1L) {
System.out.println("Value of buyIndicator :" + value.getBuyIndicator());
return true;
}
else {
return false;
}
}
};
It basically compares all incoming elements, and returns true
only for the first record from the sorted RDD.
回答4:
This seems to be always an issue with DStreams as well as regular RDDs.
If you don't want (or can't) to use .take() (especially in DStreams) you can think outside the box here and just use reduce instead. That is a valid function for both DStreams as well as RDD's.
Think about it. If you use reduce like this (Python example):
.reduce( lambda x, y : x)
Then what happens is: For every 2 elements you pass in, always return only the first. So if you have a million elements in your RDD or DStream it will shrink it to one element in the end which is the very first one in your RDD or DStream.
Simple and clean.
However keep in mind that .reduce()
does not take order into consideration. However you can easily overcome this with a custom function instead.
Example: Let's assume your data looks like this x = (1, [1,2,3]) and y = (2, [1,2]). A tuple x where the 2nd element is a list. If you are sorting by the longest list for example then your code could look like below maybe (adapt as needed):
def your_reduce(x,y):
if len(x[1]) > len(y[1]):
return x
else:
return y
yourNewRDD = yourOldRDD.reduce(your_reduce)
Accordingly you will get '(1, [1,2,3])' as that has the longer list. There you go!
This has caused me some headaches in the past until I finally tried this. Hopefully this helps.
来源:https://stackoverflow.com/questions/28697888/get-the-first-elements-take-function-of-a-dstream