I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:
l
Use filter
and zipWithIndex
:
rdd.zipWithIndex().
filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index) : index).collect()
Note that [1,2]
here can be easily changed to a variable name and this whole expression can be wrapped within a function.
zipWithIndex
simply returns a tuple of (item
,index
) like so:
rdd.zipWithIndex().collect()
> [([1, 2], 0), ([1, 4], 1)]
filter
finds only those that match a particular criterion (in this case, that key
equals a specific sublist):
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
> [([1, 2], 0)]
map
is fairly obvious, we can just get back the index:
rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
map(lambda (key,index): index).collect()
> [0]
and then we can simply get the first element by indexing [0]
if you want.