问题
I am trying to apply a filter on several columns on an rdd. I want to pass in a list of indices as a parameter to specify which ones to filter on, but pyspark only applies the last filter.
I've broken down the code into some simple test cases and tried the non-looped version and they work.
test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')]
rdd = sc.parallelize(test_input, 1)
# Index 0 needs to be longer than length 0
# Index 1 needs to be longer than length 1
for i in [0,1]:
rdd = rdd.filter(lambda arr: len(arr[i]) > i)
rdd.top(5)
# rdd.top(5) gives [('0', '00'), ('', '22')]
# Only 2nd filter applied
VS
test_input = [('0', '00'), ('1', '1'), ('', '22'), ('', '3')]
rdd = sc.parallelize(test_input, 1)
rdd = rdd.filter(lambda arr: len(arr[0]) > 0)
rdd = rdd.filter(lambda arr: len(arr[1]) > 1)
rdd.top(5)
# rdd.top(5) gives [('00', '00')] as expected
I expect the loop to give identical results compared to the non-looped version
来源:https://stackoverflow.com/questions/57154430/how-to-apply-multiple-filters-in-a-for-loop-for-pyspark