List (or iterator) of tuples returned by MAP (PySpark)

后端 未结 1 1184
误落风尘
误落风尘 2021-02-04 17:20

I have a mapper method:

def mapper(value):
    ...
    for key, value in some_list:
        yield key, value

what I need is not really far from

1条回答
  •  温柔的废话
    2021-02-04 18:00

    You can use flatMap if you want a map function that returns multiple outputs.

    The function passed to flatMap can return an iterable:

    >>> words = sc.textFile("README.md")
    >>> def mapper(line):
    ...     return ((word, 1) for word in line.split())
    ...
    >>> words.flatMap(mapper).take(4)
    [(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
    >>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
    >>> counts.take(5)
    [(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]
    

    It can also be a generator function:

    >>> words = sc.textFile("README.md")
    >>> def mapper(line):
    ...     for word in line.split():
    ...         yield (word, 1)
    ...
    >>> words.flatMap(mapper).take(4)
    [(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
    >>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
    >>> counts.take(5)
    [(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]
    

    You mentioned that you tried flatMap but it flattened everything down to a list [key, value, key, value, ...] instead of a list [(key, value), (key, value)...]of key-value pairs. I suspect that this is a problem in your map function. If you're still experiencing this problem, could you post a more complete version of your map function?

    0 讨论(0)
提交回复
热议问题