I have a mapper method:
def mapper(value):
...
for key, value in some_list:
yield key, value
what I need is not really far from
You can use flatMap
if you want a map function that returns multiple outputs.
The function passed to flatMap
can return an iterable:
>>> words = sc.textFile("README.md")
>>> def mapper(line):
... return ((word, 1) for word in line.split())
...
>>> words.flatMap(mapper).take(4)
[(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
>>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
>>> counts.take(5)
[(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]
It can also be a generator function:
>>> words = sc.textFile("README.md")
>>> def mapper(line):
... for word in line.split():
... yield (word, 1)
...
>>> words.flatMap(mapper).take(4)
[(u'#', 1), (u'Apache', 1), (u'Spark', 1), (u'Lightning-Fast', 1)]
>>> counts = words.flatMap(mapper).reduceByKey(lambda x, y: x + y)
>>> counts.take(5)
[(u'all', 1), (u'help', 1), (u'webpage', 1), (u'when', 1), (u'Hadoop', 12)]
You mentioned that you tried flatMap
but it flattened everything down to a list [key, value, key, value, ...]
instead of a list [(key, value), (key, value)...]
of key-value pairs. I suspect that this is a problem in your map function. If you're still experiencing this problem, could you post a more complete version of your map function?