(PySpark) Nested lists after reduceByKey

后端 未结 2 1831
春和景丽
春和景丽 2021-01-06 23:07

I\'m sure this is something very simple but I didn\'t find anything related to this.

My code is simple:

... 
stream = stream.map(mapper) 
stream =          


        
相关标签:
2条回答
  • 2021-01-06 23:58

    Alternatively, stream.groupByKey().mapValues(lambda x: list(x)).collect() gives

    key1 [value1]
    key2 [value2, value3]
    key3 [value4, value5, value6]
    
    0 讨论(0)
  • 2021-01-07 00:02

    The problem here is your reduce function. For each key, reduceByKey calls your reduce function with pairs of values and expects it to produce combined values of the same type.

    For example, say that I wanted to perform a word count operation. First, I can map each word to a (word, 1) pair, then I can reduceByKey(lambda x, y: x + y) to sum up the counts for each word. At the end, I'm left with an RDD of (word, count) pairs.

    Here's an example from the PySpark API Documentation:

    >>> from operator import add
    >>> rdd = sc.parallelize([("a", 1), ("b", 1), ("a", 1)])
    >>> sorted(rdd.reduceByKey(add).collect())
    [('a', 2), ('b', 1)]
    

    To understand why your example didn't work, you can imagine the reduce function being applied something like this:

    reduce(reduce(reduce(firstValue, secondValue), thirdValue), fourthValue) ...
    

    Based on your reduce function, it sounds like you might be trying to implement the built-in groupByKey operation, which groups each key with a list of its values.

    Also, take a look at combineByKey, a generalization of reduceByKey() that allows the reduce function's input and output types to differ (reduceByKey is implemented in terms of combineByKey)

    0 讨论(0)
提交回复
热议问题