Error using reducebykey: int object is unsubscriptable

后端 未结 2 524
被撕碎了的回忆
被撕碎了的回忆 2021-01-03 17:40

I\'m getting an error \"int object is unsubscriptable\" while executing the following script :

element.reduceByKey( lambda x , y : x[1]+y[1]         


        
相关标签:
2条回答
  • 2021-01-03 18:13

    Another approach would be to use Dataframe

    rdd = sc.parallelize([('A', ('toto', 10)),('A', ('titi', 30)),('5', ('tata', 10)),('A', ('toto', 10))])
    rdd.map(lambda (a,(b,c)): (a,b,c)).toDF(['a','b','c']).groupBy('a').agg(sum("c")).rdd.map(lambda (a,c): (a,c)).collect()
    
    >>>[(u'5', 10), (u'A', 50)]
    
    0 讨论(0)
  • 2021-01-03 18:20

    Here is an example that will illustrate what's going on.

    Let's consider what happens when you call reduce on a list with some function f:

    reduce(f, [a,b,c]) = f(f(a,b),c)
    

    If we take your example, f = lambda u, v: u[1] + v[1], then the above expression breaks down into:

    reduce(f, [a,b,c]) = f(f(a,b),c) = f(a[1]+b[1],c)
    

    But a[1] + b[1] is an integer so there is no __getitem__ method, hence your error.

    In general, the better approach (as shown below) is to use map() to first extract the data in the format that you want, and then apply reduceByKey().


    A MCVE with your data

    element = sc.parallelize(
        [
            ('A', ('toto' , 10)),
            ('A', ('titi' , 30)),
            ('5', ('tata', 10)),
            ('A', ('toto', 10))
        ]
    )
    

    You can almost get your desired output with a more sophisticated reduce function:

    def add_tuple_values(a, b):
        try:
            u = a[1]
        except:
            u = a
        try:
            v = b[1]
        except:
            v = b
        return u + v
    
    print(element.reduceByKey(add_tuple_values).collect())
    

    Except that this results in:

    [('A', 50), ('5', ('tata', 10))]
    

    Why? Because there's only one value for the key '5', so there is nothing to reduce.

    For these reasons, it's best to first call map. To get your desired output, you could do:

    >>> print(element.map(lambda x: (x[0], x[1][1])).reduceByKey(lambda u, v: u+v).collect())
    [('A', 50), ('5', 10)]
    

    Update 1

    Here's one more approach:

    You could create tuples in your reduce function, and then call map to extract the value you want. (Essentially reverse the order of map and reduce.)

    print(
        element.reduceByKey(lambda u, v: (0,u[1]+v[1]))
            .map(lambda x: (x[0], x[1][1]))
            .collect()
    )
    [('A', 50), ('5', 10)]
    

    Notes

    • Had there been at least 2 records for each key, using add_tuple_values() would have given you the correct output.
    0 讨论(0)
提交回复
热议问题