I have a big RDD of Strings (obtained through a union of several sc.textFile(...))
.
I now want to search for a given string in that RDD, and I want the search to stop when a "good enough" match has been found.
I could retrofit foreach
, or filter
, or map
for this purpose, but all of these will iterate through every element in that RDD, regardless of whether the match has been reached.
Is there a way to short-circuit this process and avoid iterating through the whole RDD?
I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD
Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take
or first
):
import numpy as np
from __future__ import print_function
np.random.seed(323)
acc = sc.accumulator(0)
def good_enough(x, threshold=7000):
global acc
acc += 1
return x > threshold
rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))
x = rdd.filter(good_enough).first()
Now lets check accum:
>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109
and just to be sure if everything works as expected:
acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()
Same thing could be done, probably in a more efficient manner using data frames and udf.
Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.
Not really. There is no find
method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.
Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.
来源:https://stackoverflow.com/questions/31542779/lazy-foreach-on-a-spark-rdd