Pyspark RDD: find index of an element

前端 未结 1 1697
清酒与你
清酒与你 2020-12-29 00:12

I am new to pyspark and I am trying to convert a list in python to rdd and then I need to find elements index using the rdd. For the first part I am doing:

l         


        
相关标签:
1条回答
  • 2020-12-29 00:42

    Use filter and zipWithIndex:

    rdd.zipWithIndex().
    filter(lambda (key,index) : key == [1,2]).
    map(lambda (key,index) : index).collect()
    

    Note that [1,2] here can be easily changed to a variable name and this whole expression can be wrapped within a function.

    How It Works

    zipWithIndex simply returns a tuple of (item,index) like so:

    rdd.zipWithIndex().collect()
    > [([1, 2], 0), ([1, 4], 1)]
    

    filter finds only those that match a particular criterion (in this case, that key equals a specific sublist):

    rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).collect()
    > [([1, 2], 0)]
    

    map is fairly obvious, we can just get back the index:

    rdd.zipWithIndex().filter(lambda (key,index) : key == [1,2]).
    map(lambda (key,index): index).collect()
    > [0]
    

    and then we can simply get the first element by indexing [0] if you want.

    0 讨论(0)
提交回复
热议问题