How to process RDDs using a Python class?

后端 未结 1 1472
面向向阳花
面向向阳花 2020-12-03 00:12

I\'m implementing a model in Spark as a python class, and any time I try to map a class method to an RDD it fails. My actual code is more complicated, but this simplified ve

相关标签:
1条回答
  • 2020-12-03 00:37

    Problem here is a little bit more subtle than using nested RDDs or performing Spark actions inside of transformations. Spark doesn't allow access to the SparkContext inside action or transformation.

    Even you don't access it explicitly it is referenced inside the closure and has to be serialized and carried around. It means that your transformation method, which references self, keeps SparkContext as well, hence the error.

    One way to handle this is to use static method:

    class model(object):
        @staticmethod
        def transformation_function(row):
            row = row.split(',')
            return row[0]+row[1]
    
        def __init__(self):
            self.data = sc.textFile('some.csv')
    
        def run_model(self):
            self.data = self.data.map(model.transformation_function)
    

    Edit:

    If you want to be able to access instance variables you can try something like this:

    class model(object):
        @staticmethod
        def transformation_function(a_model):
            delim = a_model.delim
            def _transformation_function(row):
                return row.split(delim)
            return _transformation_function
    
        def __init__(self):
            self.delim = ','
            self.data = sc.textFile('some.csv')
    
        def run_model(self):
            self.data = self.data.map(model.transformation_function(self))
    
    0 讨论(0)
提交回复
热议问题