How does the pyspark mapPartitions function work?

前端 未结 4 1282
伪装坚强ぢ
伪装坚强ぢ 2020-12-04 16:44

So I am trying to learn Spark using Python (Pyspark). I want to know how the function mapPartitions work. That is what Input it takes and what Output it gives.

相关标签:
4条回答
  • 2020-12-04 17:17

    mapPartition should be thought of as a map operation over partitions and not over the elements of the partition. It's input is the set of current partitions its output will be another set of partitions.

    The function you pass map must take an individual element of your RDD

    The function you pass mapPartition must take an iterable of your RDD type and return and iterable of some other or the same type.

    In your case you probably just want to do something like

    def filter_out_2(line):
        return [x for x in line if x != 2]
    
    filtered_lists = data.map(filterOut2)
    

    if you wanted to use mapPartition it would be

    def filter_out_2_from_partition(list_of_lists):
      final_iterator = []
      for sub_list in list_of_lists:
        final_iterator.append( [x for x in sub_list if x != 2])
      return iter(final_iterator)
    
    filtered_lists = data.mapPartition(filterOut2FromPartion)
    
    0 讨论(0)
  • 2020-12-04 17:17
         def func(l):
             for i in l:
                 yield i+"ajbf"
    
    
         mylist=['madhu','sdgs','sjhf','mad']
         rdd=sc.parallelize(mylist)
         t=rdd.mapPartitions(func)
         for i in t.collect():
             print(i)
         for i in t.collect():
            print(i)
    

    in the above code I am able get data from 2nd for..in loop.. as per generator it should not should values once its iterate over the loop

    0 讨论(0)
  • 2020-12-04 17:36

    Need a final Iter

    def filter_out_2(partition):
    for element in partition:
        sec_iterator = []
        for i in element:
            if i!= 2:
                sec_iterator.append(i)
        yield sec_iterator
    
    filtered_lists = data.mapPartitions(filter_out_2)
    for i in filtered_lists.collect(): print(i)
    
    0 讨论(0)
  • 2020-12-04 17:39

    It's easier to use mapPartitions with a generator function using the yield syntax:

    def filter_out_2(partition):
        for element in partition:
            if element != 2:
                yield element
    
    filtered_lists = data.mapPartitions(filter_out_2)
    
    0 讨论(0)
提交回复
热议问题