问题
Cheerz,
Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey
OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?
I have designed an algorithm to do it with reduceByKey
, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.
Any ideas appreciated, thanks.
回答1:
There are a few ways you can approach this problem:
repartitionAndSortWithinPartitions
with custom partitioner and ordering:keyBy
(name, timestamp) pairs- create custom partitioner which considers only the name
repartitionAndSortWithinPartitions
using custom partitioner- use
mapPartitions
to iterate over data and yield matching sequences
sortBy(Key)
- this is similar to the first solution but provides higher granularity at the cost of additional post-processing.keyBy
(name, timestamp) pairssortByKey
- process individual partitions using
mapPartitionsWithIndex
keeping track of leading / trailing patterns for each partition - adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using
sliding
frommllib.rdd.RDDFunctions
.sortBy
(name, timestamp)- create sliding RDD and filter windows which cover multiple
names
- check if any window contains desired pattern.
来源:https://stackoverflow.com/questions/35579619/detecting-repeating-consecutive-values-in-large-datasets-with-spark