Python MapReduce - Count Frequency of numbers in a file

徘徊边缘 提交于 2020-12-15 03:50:19

问题


I posted a question earlier regarding this. I'm using DVD data sent from here. My Mapper produced the required key and value with the following

import sys
import re

# input comes from STDIN (standard input)
for line in sys.stdin:
    line = line.strip()
    line = line.split(",")

    if len(line) == 5:
        try:
            cpu = int(line[0])
            if cpu < 10:
                cpu = '(0-10]'
            elif cpu >= 11 and cpu <= 20:
                cpu = '(11-20]'
            elif cpu >= 21 and cpu <= 30:
                cpu = '(21-30]'
            elif cpu >= 31 and cpu <= 40:
                cpu = '(31-40]'
            elif cpu >= 41 and cpu <= 50:
                cpu = '(41-50]'
            elif cpu >= 51 and cpu <= 60:
                cpu = '(51-60]'
            elif cpu >= 61 and cpu <= 70:
                cpu = '(61-70]'
            elif cpu >= 71 and cpu <= 80:
                cpu = '(71-80]'
            elif cpu >= 81 and cpu <= 90:
                cpu = '(81-90]'
            else:
                cpu = '(91-100]'
            print("%s\t%s" % (cpu, line[4]))
        except ValueError:
            pass

Mapper Output <'(0-10]', line(5)> key - '(0-10]' -for values between 0 and 10 in column 1 and corresponding column 5 value.

However, while trying to reduce such that, I could have all corresponding values (in column 5) assigned to each key in a list. My reduce function would only produce results for the last condition of my if/else statement.

from operator import itemgetter 
import sys
from itertools import groupby
import statistics

# keep a map of the sum of upvotes of each reddit
output = {}
memory_list = []
reducer_key = None
mapper_key = None
def stdev(l):
    mean = sum(l) / len(l)
    variance = sum([((x - mean) ** 2) for x in l]) / len(l)
    res = variance ** 0.5
    return res
for line in sys.stdin:
    line = line.strip()
    mapper_key, memory= line.split('\t')
    if mapper_key == '(0-10]':
        memory_list.append(memory)
        reducer_key = mapper_key
        #median = statistics.median(memory_list) 
        #stdev = stdev(memory_list)
    elif mapper_key == '(11-20]':
        memory_list.append(memory)
print('%s\t%s\n\nMin: %s\n\nMax: %s\n\nMedian: %s\n\nStandard Deviation: %s' 
      % (reducer_key, memory_list, min(memory_list), max(memory_list),
      1, 1))

Result for the first key looks alright : (0-10] ['20.26565539', '19.7227076', '21.37316718', '13.94801471', '9.422153029', '7.917970305', '10.00108978', '13.89255583', '11.1138818', '10.29432665', '11.34895276', '22.73229654', '23.1138449', '8.270710346', '5.751964522', '1.761210101', '1.403473432', '4.759806413', '12.40190548', '11.52469147', '13.20220123', '17.93094418', '9.038727772', '4.701368748', '3.160323442', '5.221341297', '10.67699047', '9.769708796', '11.48008641', '16.84765228', '14.12331248', '11.49042139', '8.561695581', '5.753995246', '4.046531112', '4.441022767', '7.473833196', '19.68088202', '11.40191563', '8.539957569', '10.45491398', '12.44943809', '22.72657144', '8.412798914', '3.299788321', '2.60126778', '13.41506318', '11.99137687', '6.608127202', '3.732756464', '10.34080263', '15.42412562', '10.05551595', '4.694121705', '5.621777424', '20.32544635', '8.219092814', '12.26876796', '23.20058915', '1.129506145', '11.71012113', '3.506646328', '9.863296707', '7.088767753']

Min: 1.129506145

Max: 9.863296707

Median: 1

Standard Deviation: 1

if mapper_key == '(0-10]':
        memory_list.append(memory)
        reducer_key = mapper_key
        #median = statistics.median(memory_list) 
        #stdev = stdev(memory_list)
        print('%s\t%s\n\nMin: %s\n\nMax: %s\n\nMedian: %s\n\nStandard Deviation: %s' 
            % (reducer_key, memory_list, min(memory_list), max(memory_list),
            1, 1))

But, the second elif statement combines the value of the first 'if' statement with its own values. I have tried the print statement at different indentations.

来源:https://stackoverflow.com/questions/65187820/python-mapreduce-count-frequency-of-numbers-in-a-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!