Spark RDD - Mapping with extra arguments

后端 未结 1 533
星月不相逢
星月不相逢 2021-02-01 02:49

Is it possible to pass extra arguments to the mapping function in pySpark? Specifically, I have the following code recipe:

raw_data_rdd = sc.textFile(\"data.jso         


        
相关标签:
1条回答
  • 2021-02-01 03:24
    1. You can use an anonymous function either directly in a flatMap

      json_data_rdd.flatMap(lambda j: processDataLine(j, arg1, arg2))
      

      or to curry processDataLine

      f = lambda j: processDataLine(dataline, arg1, arg2)
      json_data_rdd.flatMap(f)
      
    2. You can generate processDataLine like this:

      def processDataLine(arg1, arg2):
          def _processDataLine(dataline):
              return ... # Do something with dataline, arg1, arg2
          return _processDataLine
      
      json_data_rdd.flatMap(processDataLine(arg1, arg2))
      
    3. toolz library provides useful curry decorator:

      from toolz.functoolz import curry
      
      @curry
      def processDataLine(arg1, arg2, dataline): 
          return ... # Do something with dataline, arg1, arg2
      
      json_data_rdd.flatMap(processDataLine(arg1, arg2))
      

      Note that I've pushed dataline argument to the last position. It is not required but this way we don't have to use keyword args.

    4. Finally there is functools.partial already mentioned by Avihoo Mamka in the comments.

    0 讨论(0)
提交回复
热议问题