PySpark groupByKey returning pyspark.resultiterable.ResultIterable

后端 未结 6 1244
不思量自难忘°
不思量自难忘° 2021-01-30 16:24

I am trying to figure out why my groupByKey is returning the following:

[(0, ), (1, 

        
相关标签:
6条回答
  • 2021-01-30 16:47

    Instead of using groupByKey(), i would suggest you use cogroup(). You can refer the below example.

    [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
    

    Example:

    >>> x = sc.parallelize([("foo", 1), ("bar", 4)])
    >>> y = sc.parallelize([("foo", -1)])
    >>> z = [(x, tuple(map(list, y))) for x, y in sorted(list(x.cogroup(y).collect()))]
    >>> print(z)
    

    You should get the desired output...

    0 讨论(0)
  • 2021-01-30 16:47

    Say your code is..

    ex2 = ex1.groupByKey()
    

    And then you run..

    ex2.take(5)
    

    You're going to see an iterable. This is okay if you're going to do something with this data, you can just move on. But, if all you want is to print/see the values first before moving on, here is a bit of a hack..

    ex2.toDF().show(20, False)
    

    or just

    ex2.toDF().show()
    

    This will show the values of the data. You shouldn't use collect() because that will return data to the driver, and if you're working off a lot of data, that's going to blow up on you. Now if ex2 = ex1.groupByKey() was your final step, and you want those results returned, then yes use collect() but make sure that you know your data being returned is low volume.

    print(ex2.collect())
    

    Here is another nice post on using collect() on RDD

    View RDD contents in Python Spark?

    0 讨论(0)
  • 2021-01-30 16:49

    In addition to above answers, if you want the sorted list of unique items, use following:

    List of Distinct and Sorted Values

    example.groupByKey().mapValues(set).mapValues(sorted)
    

    Just List of Sorted Values

    example.groupByKey().mapValues(sorted)
    

    Alternative's to above

    # List of distinct sorted items
    example.groupByKey().map(lambda x: (x[0], sorted(set(x[1]))))
    
    # just sorted list of items
    example.groupByKey().map(lambda x: (x[0], sorted(x[1])))
    
    0 讨论(0)
  • 2021-01-30 16:52

    What you're getting back is an object which allows you to iterate over the results. You can turn the results of groupByKey into a list by calling list() on the values, e.g.

    example = sc.parallelize([(0, u'D'), (0, u'D'), (1, u'E'), (2, u'F')])
    
    example.groupByKey().collect()
    # Gives [(0, <pyspark.resultiterable.ResultIterable object ......]
    
    example.groupByKey().map(lambda x : (x[0], list(x[1]))).collect()
    # Gives [(0, [u'D', u'D']), (1, [u'E']), (2, [u'F'])]
    
    0 讨论(0)
  • 2021-01-30 17:00

    you can also use

    example.groupByKey().mapValues(list)
    
    0 讨论(0)
  • 2021-01-30 17:00

    Example:

    r1 = sc.parallelize([('a',1),('b',2)])
    r2 = sc.parallelize([('b',1),('d',2)])
    r1.cogroup(r2).mapValues(lambda x:tuple(reduce(add,__builtin__.map(list,x))))
    

    Result:

    [('d', (2,)), ('b', (2, 1)), ('a', (1,))]
    
    0 讨论(0)
提交回复
热议问题