Detecting repeating consecutive values in large datasets with Spark

问题

Cheerz,

Recently I have being trying out Spark and do far I have observed quite interesting results, but currently I am stuck with famous groupByKey OOM problem. Basically what the job does it tries to search in the large datasets the periods where measured value is increasing consecutively for at least N times. I managed to get rid of the problem by writing the results to the disk, but the application is running much slower now (which is expected due to the disk IO). Now the question: is there any other memory efficient strategy where I can run sorted data and check whether adjacent values(for the same key) are increasing in at least N consecutive observations, without recurring to groupByKey method?

I have designed an algorithm to do it with reduceByKey, but there is one problem, reduce seems to ignore data ordering and yells completely wrong results at the end.

Any ideas appreciated, thanks.

回答1:

There are a few ways you can approach this problem:

repartitionAndSortWithinPartitions with custom partitioner and ordering:
- keyBy (name, timestamp) pairs
- create custom partitioner which considers only the name
- repartitionAndSortWithinPartitions using custom partitioner
- use mapPartitions to iterate over data and yield matching sequences
sortBy(Key) - this is similar to the first solution but provides higher granularity at the cost of additional post-processing.
- keyBy (name, timestamp) pairs
- sortByKey
- process individual partitions using mapPartitionsWithIndex keeping track of leading / trailing patterns for each partition
- adjust final results to include patterns which span over more than one partitions
create fixed sized windows over sorted data using sliding from mllib.rdd.RDDFunctions.
- sortBy (name, timestamp)
- create sliding RDD and filter windows which cover multiple names
- check if any window contains desired pattern.

来源：https://stackoverflow.com/questions/35579619/detecting-repeating-consecutive-values-in-large-datasets-with-spark

标签

java

apache-spark

reduce