Lazy foreach on a Spark RDD

I have a big RDD of Strings (obtained through a union of several sc.textFile(...)).

I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.

I could retrofit foreach, or filter, or mapfor this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.

Is there a way to short-circuit this process and avoid iterating through the whole RDD?

zero323

I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD

Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):

import numpy as np
from __future__ import print_function

np.random.seed(323)

acc = sc.accumulator(0)

def good_enough(x, threshold=7000):
    global acc
    acc += 1
    return x > threshold

rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))

x = rdd.filter(good_enough).first()

Now lets check accum:

>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109

and just to be sure if everything works as expected:

acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()

Same thing could be done, probably in a more efficient manner using data frames and udf.

Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.

Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.

Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

来源：https://stackoverflow.com/questions/31542779/lazy-foreach-on-a-spark-rdd

标签

apache-spark

rdd

lazy-sequences