Lazy foreach on a Spark RDD

吃可爱长大的小学妹 提交于 2019-11-29 16:06:35
zero323

I could retrofit foreach, or filter, or map for this purpose, but all of these will iterate through every element in that RDD

Actually, you're wrong. Spark engine is smart enough to optimize computations if you limit the results (using take or first):

import numpy as np
from __future__ import print_function

np.random.seed(323)

acc = sc.accumulator(0)

def good_enough(x, threshold=7000):
    global acc
    acc += 1
    return x > threshold

rdd = sc.parallelize(np.random.randint(0, 10000) for i in xrange(1000000))

x = rdd.filter(good_enough).first()

Now lets check accum:

>>> print("Checked {0} items, found {1}".format(acc.value, x))
Checked 6 items, found 7109

and just to be sure if everything works as expected:

acc = sc.accumulator(0)
rdd.filter(lambda x: good_enough(x, 100000)).take(1)
assert acc.value == rdd.count()

Same thing could be done, probably in a more efficient manner using data frames and udf.

Note: In some cases it is even possible to use an infinite sequence in Spark and still get a result. You can check my answer to Spark FlatMap function for huge lists for an example.

Not really. There is no find method, as in the Scala collections that inspired the Spark APIs, which would stop looking once an element is found that satisfies a predicate. Probably your best bet is to use a data source that will minimize excess scanning, like Cassandra, where the driver pushes down some query parameters. You might also look at the more experimental Berkeley project called BlinkDB.

Bottom line, Spark is designed more for scanning data sets, like MapReduce before it, rather than traditional database-like queries.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!