Facing issue in Mapper.py and Reducer.py when running code in Hadoop cluster

流过昼夜 提交于 2019-12-24 19:33:51

问题


Running this code to take Probability in Hadoop cluster my data in CSV File.

When I run this code in cluster getting this error "java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1" anyone fix my code.

#!/usr/bin/env python3
"""mapper.py"""
import sys

# Get input lines from stdin
for line in sys.stdin:
    # Remove spaces from beginning and end of the line
    line = line.strip()

    # Split it into tokens
    #tokens = line.split()

    #Get probability_mass values
    for probability_mass in line:
        print(str(probability_mass)+ '\t1')
#!/usr/bin/env python3
"""reducer.py"""
import sys
from collections import defaultdict


counts = defaultdict(int)

# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()

    # skip empty lines
    if not line:
        continue  

    # parse the input from mapper.py
    k,v = line.split('\t', 1)
    counts[v] += 1

total = sum(counts.values())
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)
marks
10
10
60
10
30
Expected output Probability of each number

{10: 0.6, 60: 0.2, 30: 0.2}

but result still show like this 
{1:1} {1:1} {1:1} {1:1} {1:1} {1:1}


回答1:


The real error should be available in the YARN UI, but putting the probability as the key won't allow you to sum all the values at once because they all would end up in different reducers

If you have no key to group the values by, then you can use this, which should funnel all data to one reducer

print('%s\t%s' % (None, probability_mass))

Here is a working example for the output that you wanted, which I tested with just an input file, not in Hadoop

import sys
from collections import defaultdict

counts = defaultdict(int)

# Get input from stdin
for line in sys.stdin:
    #Remove spaces from beginning and end of the line
    line = line.strip()

    # skip empty lines
    if not line:
        continue  

    # parse the input from mapper.py
    k,v = line.split('\t', 1)
    counts[v] += 1

total = float(sum(counts.values()))
probability_mass = {k:v/total for k,v in counts.items()}
print(probability_mass)

Output

{'10': 0.6, '60': 0.2, '30': 0.2}

You can test your code without Hadoop with cat file.txt | python mapper.py | sort -u | python reducer.py

Plus, mrjob or pyspark are higher level languages that would provide more useful features



来源:https://stackoverflow.com/questions/59140021/facing-issue-in-mapper-py-and-reducer-py-when-running-code-in-hadoop-cluster

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!