hadoop map reduce -archives not unpacking archives

问题

hope you can help me. I've got a head-scratching problem with hadoop map-reduce. I've been using the "-files" option successfully on a map-reduce, with hadoop version 1.0.3. However, when I use the "-archives" option, it copies the files, but does not uncompress them. What am I missing? The documentation says "Archives (zip, tar and tgz/tar.gz files) are un-archived at the slave nodes", but that's not what I'm seeing.

I have created 3 files - a text file "alice.txt", a zip file "bob.zip" (containing b1.txt and bdir/b2.txt), and a tar file "claire.tar" (containing c1.txt and cdir/c2.txt). I then invoke the hadoop job via

hadoop jar myJar myClass -files ./etc/alice.txt -archives ./etc/bob.zip,./etc/claire.tar <input_path> <output_path>

The files are indeed there and well-formed:

% ls -l etc/alice.txt etc/bob.zip etc/claire.tar
-rw-rw-r-- 1 hadoop hadoop     6 Aug 20 18:44 etc/alice.txt
-rw-rw-r-- 1 hadoop hadoop   282 Aug 20 18:44 etc/bob.zip
-rw-rw-r-- 1 hadoop hadoop 10240 Aug 20 18:44 etc/claire.tar
% tar tf etc/claire.tar
c1.txt
cdir/c2.txt

I then have my mapper test for the existence of the files in question, like so, where 'lineNumber' is the key passed into the mapper:

String key = Long.toString(lineNumber.get());
String [] files = {
    "alice.txt",
    "bob.zip",
    "claire.tar",
    "bdir",
    "cdir",
    "b1.txt",
    "b2.txt",
    "bdir/b2.txt",
    "c1.txt",
    "c2.txt",
    "cdir/c2.txt"
};
String fName = files[ (int) (lineNumber.get() % files.length)];
String val = codeFile(fName);
output.collect(new Text(key), new Text(val));

The support routine 'codeFile' is:

private String codeFile(String fName) {
    Vector<String> clauses = new Vector<String>();
    clauses.add(fName);
    File f = new File(fName);

    if (!f.exists()) {
        clauses.add("nonexistent");
    } else {
        if (f.canRead()) clauses.add("readable");
        if (f.canWrite()) clauses.add("writable");
        if (f.canExecute()) clauses.add("executable");
        if (f.isDirectory()) clauses.add("dir");
        if (f.isFile()) clauses.add("file");
    }
    return Joiner.on(',').join(clauses);
}

Using the Guava 'Joiner' class. The output values from the mapper look like this:

alice.txt,readable,writable,executable,file
bob.zip,readable,writable,executable,dir
claire.tar,readable,writable,executable,dir
bdir,nonexistent
b1.txt,nonexistent
b2.txt,nonexistent
bdir/b2.txt,nonexistent
cdir,nonexistent
c1.txt,nonexistent
c2.txt,nonexistent
cdir/c2.txt,nonexistent

So you see the problem - the archive files are there, but they are not unpacked. What am I missing? I have also tried using DistributedCache.addCacheArchive() instead of using -archives, but the problem is still there.

回答1:

the distributed cache doesn't unpack the archives files to the local working directory of your task - there's a location on each task tracker for job as a whole, and it's unpacked there.

You'll need to check the DistributedCache to find this location and look for the files there. The Javadocs for DistributedCache show an example mapper pulling this information.

You can use symbolic linking when defining the -files and -archives generic options and a symlink will be created in the local working directory of the map / reduce tasks making this easier:

hadoop jar myJar myClass -files ./etc/alice.txt#file1.txt \
    -archives ./etc/bob.zip#bob,./etc/claire.tar#claire

And then you can use the fragment names in your mapper when trying to open files in the archive:

new File("bob").isDirectory() == true

来源：https://stackoverflow.com/questions/18343371/hadoop-map-reduce-archives-not-unpacking-archives

标签

Hadoop