Loading a defaultdict in Hadoop using pickle and sys.stdin

问题

I posted a similar question about an hour ago, but have since deleted it after realising I was asking the wrong question. I have the following pickled defaultdict:

ccollections
defaultdict
p0
(c__builtin__
list
p1
tp2
Rp3
V"I love that"
p4
(lp5
S'05-Aug-13 10:17'
p6
aS'05-Aug-13 10:17'

When using Hadoop, the input is always read in using:

for line in sys.stdin:

I tried reading the pickled defaultdict using this:

myDict = pickle.load(sys.stdin)
for text, date in myDict.iteritems():

But to no avail. The rest of the code works as I tested it locally using .load('filename.txt'). Am I doing this wrong? How can I load the information?

Update:

After following an online tutorial, I can amend my code to this:

def read_input(file):
    for line in file:
        print line

def main(separator='\t'):
    myDict = read_input(sys.stdin)

This prints out each line, showing it is successfully reading the file - however, no semblence of the defaultdict structure is kept, with this output:

p769    

aS'05-Aug-13 10:19' 

p770    

aS'05-Aug-13 15:19' 

p771    

as"I love that"

Obviously this is no good. Does anybody have any suggestions?

回答1:

Why is your input data in the pickle format? Where does your input data come from? One of the goals of Hadoop/MapReduce is to process data that's too large to fit into the memory of a single machine. Thus, reading the whole input data and then trying to deserialize it runs contrary to the MR design paradigm and most likely won't even work with production-scale data sets.

The solution is to format your input data as a, for example, TSV text file with exactly one tuple of your dictionary per row. You can then process each tuple on its own, e.g.:

for line in sys.stdin:
    tuple = line.split("\t")
    key, value = process(tuple)
    emit(key, value)

回答2:

If you read in the data completely, I believe you can use pickle.loads().

myDict = pickle.loads(sys.stdin.read())

来源：https://stackoverflow.com/questions/18580321/loading-a-defaultdict-in-hadoop-using-pickle-and-sys-stdin

标签

python

Hadoop

sys

defaultdict