问题
I posted a question earlier regarding this. I'm using DVD data sent from here. My Mapper produced the required key and value with the following
import sys
import re
# input comes from STDIN (standard input)
for line in sys.stdin:
line = line.strip()
line = line.split(",")
if len(line) == 5:
try:
cpu = int(line[0])
if cpu < 10:
cpu = '(0-10]'
elif cpu >= 11 and cpu <= 20:
cpu = '(11-20]'
elif cpu >= 21 and cpu <= 30:
cpu = '(21-30]'
elif cpu >= 31 and cpu <= 40:
cpu = '(31-40]'
elif cpu >= 41 and cpu <= 50:
cpu = '(41-50]'
elif cpu >= 51 and cpu <= 60:
cpu = '(51-60]'
elif cpu >= 61 and cpu <= 70:
cpu = '(61-70]'
elif cpu >= 71 and cpu <= 80:
cpu = '(71-80]'
elif cpu >= 81 and cpu <= 90:
cpu = '(81-90]'
else:
cpu = '(91-100]'
print("%s\t%s" % (cpu, line[4]))
except ValueError:
pass
Mapper Output <'(0-10]', line(5)> key - '(0-10]' -for values between 0 and 10 in column 1 and corresponding column 5 value.
However, while trying to reduce such that, I could have all corresponding values (in column 5) assigned to each key in a list. My reduce function would only produce results for the last condition of my if/else statement.
from operator import itemgetter
import sys
from itertools import groupby
import statistics
# keep a map of the sum of upvotes of each reddit
output = {}
memory_list = []
reducer_key = None
mapper_key = None
def stdev(l):
mean = sum(l) / len(l)
variance = sum([((x - mean) ** 2) for x in l]) / len(l)
res = variance ** 0.5
return res
for line in sys.stdin:
line = line.strip()
mapper_key, memory= line.split('\t')
if mapper_key == '(0-10]':
memory_list.append(memory)
reducer_key = mapper_key
#median = statistics.median(memory_list)
#stdev = stdev(memory_list)
elif mapper_key == '(11-20]':
memory_list.append(memory)
print('%s\t%s\n\nMin: %s\n\nMax: %s\n\nMedian: %s\n\nStandard Deviation: %s'
% (reducer_key, memory_list, min(memory_list), max(memory_list),
1, 1))
Result for the first key looks alright : (0-10] ['20.26565539', '19.7227076', '21.37316718', '13.94801471', '9.422153029', '7.917970305', '10.00108978', '13.89255583', '11.1138818', '10.29432665', '11.34895276', '22.73229654', '23.1138449', '8.270710346', '5.751964522', '1.761210101', '1.403473432', '4.759806413', '12.40190548', '11.52469147', '13.20220123', '17.93094418', '9.038727772', '4.701368748', '3.160323442', '5.221341297', '10.67699047', '9.769708796', '11.48008641', '16.84765228', '14.12331248', '11.49042139', '8.561695581', '5.753995246', '4.046531112', '4.441022767', '7.473833196', '19.68088202', '11.40191563', '8.539957569', '10.45491398', '12.44943809', '22.72657144', '8.412798914', '3.299788321', '2.60126778', '13.41506318', '11.99137687', '6.608127202', '3.732756464', '10.34080263', '15.42412562', '10.05551595', '4.694121705', '5.621777424', '20.32544635', '8.219092814', '12.26876796', '23.20058915', '1.129506145', '11.71012113', '3.506646328', '9.863296707', '7.088767753']
Min: 1.129506145
Max: 9.863296707
Median: 1
Standard Deviation: 1
if mapper_key == '(0-10]':
memory_list.append(memory)
reducer_key = mapper_key
#median = statistics.median(memory_list)
#stdev = stdev(memory_list)
print('%s\t%s\n\nMin: %s\n\nMax: %s\n\nMedian: %s\n\nStandard Deviation: %s'
% (reducer_key, memory_list, min(memory_list), max(memory_list),
1, 1))
But, the second elif statement combines the value of the first 'if' statement with its own values. I have tried the print statement at different indentations.
来源:https://stackoverflow.com/questions/65187820/python-mapreduce-count-frequency-of-numbers-in-a-file