Hadoop DistributedCache is deprecated - what is the preferred API?

前端 未结 6 1229
情深已故
情深已故 2020-11-28 04:14

My map tasks need some configuration data, which I would like to distribute via the Distributed Cache.

The Hadoop MapReduce Tutorial shows the usage of the Distribut

相关标签:
6条回答
  • 2020-11-28 04:55

    I had the same problem. And not only is DistributedCach deprecated but getLocalCacheFiles and "new Job" too. So what worked for me is the following:

    Driver:

    Configuration conf = getConf();
    Job job = Job.getInstance(conf);
    ...
    job.addCacheFile(new Path(filename).toUri());
    

    In Mapper/Reducer setup:

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        super.setup(context);
    
        URI[] files = context.getCacheFiles(); // getCacheFiles returns null
    
        Path file1path = new Path(files[0])
        ...
    }
    
    0 讨论(0)
  • 2020-11-28 05:05

    The APIs for the Distributed Cache can be found in the Job class itself. Check the documentation here: http://hadoop.apache.org/docs/stable2/api/org/apache/hadoop/mapreduce/Job.html The code should be something like

    Job job = new Job();
    ...
    job.addCacheFile(new Path(filename).toUri());
    

    In your mapper code:

    Path[] localPaths = context.getLocalCacheFiles();
    ...
    
    0 讨论(0)
  • 2020-11-28 05:15

    None of the solution mentioned worked for me in completeness . It could because Hadoop version keeps changing I am using hadoop 2.6.4. Essentially, DistributedCache is deprecated so I didnt want to use that. As some of the post suggest us to use addCacheFile() however, it has changed a bit. Here is how it worked for me

    job.addCacheFile(new URI("hdfs://X.X.X.X:9000/EnglishStop.txt#EnglishStop.txt"));
    

    Here X.X.X.X can be Master IP address or localhost. The EnglishStop.txt was stored in HDFS at / location.

    hadoop fs -ls /
    

    The output is

    -rw-r--r--   3 centos supergroup       1833 2016-03-12 20:24 /EnglishStop.txt
    drwxr-xr-x   - centos supergroup          0 2016-03-12 19:46 /test
    

    Funny but convenient, #EnglishStop.txt means now we can access it as "EnglishStop.txt" in mapper. Here is the code for the same

    public void setup(Context context) throws IOException, InterruptedException     
    {
        File stopwordFile = new File("EnglishStop.txt");
        FileInputStream fis = new FileInputStream(stopwordFile);
        BufferedReader reader = new BufferedReader(new InputStreamReader(fis));
    
        while ((stopWord = reader.readLine()) != null) {
            // stopWord is a word read from Cache
        }
    }
    

    This just worked for me. You can read line from the file stored in HDFS

    0 讨论(0)
  • 2020-11-28 05:16

    I did not use job.addCacheFile(). Instead I used -files option like "-files /path/to/myfile.txt#myfile" as before. Then in the mapper or reducer code I use the method below:

    /**
     * This method can be used with local execution or HDFS execution. 
     * 
     * @param context
     * @param symLink
     * @param throwExceptionIfNotFound
     * @return
     * @throws IOException
     */
    public static File findDistributedFileBySymlink(JobContext context, String symLink, boolean throwExceptionIfNotFound) throws IOException
    {
        URI[] uris = context.getCacheFiles();
        if(uris==null||uris.length==0)
        {
            if(throwExceptionIfNotFound)
                throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
            return null;
        }
        URI symlinkUri = null;
        for(URI uri: uris)
        {
            if(symLink.equals(uri.getFragment()))
            {
                symlinkUri = uri;
                break;
            }
        }   
        if(symlinkUri==null)
        {
            if(throwExceptionIfNotFound)
                throw new RuntimeException("Unable to find file with symlink '"+symLink+"' in distributed cache");
            return null;
        }
        //if we run this locally the file system URI scheme will be "file" otherwise it should be a symlink
        return "file".equalsIgnoreCase(FileSystem.get(context.getConfiguration()).getScheme())?(new File(symlinkUri.getPath())):new File(symLink);
    
    }
    

    Then in mapper/reducer:

    @Override
    protected void setup(Context context) throws IOException, InterruptedException
    {
        super.setup(context);
    
        File file = HadoopUtils.findDistributedFileBySymlink(context,"myfile",true);
        ... do work ...
    }
    

    Note that if I used "-files /path/to/myfile.txt" directly then I need to use "myfile.txt" to access the file since that is the default symlink name.

    0 讨论(0)
  • 2020-11-28 05:18

    To expand on @jtravaglini, the preferred way of using DistributedCache for YARN/MapReduce 2 is as follows:

    In your driver, use the Job.addCacheFile()

    public int run(String[] args) throws Exception {
        Configuration conf = getConf();
    
        Job job = Job.getInstance(conf, "MyJob");
    
        job.setMapperClass(MyMapper.class);
    
        // ...
    
        // Mind the # sign after the absolute file location.
        // You will be using the name after the # sign as your
        // file name in your Mapper/Reducer
        job.addCacheFile(new URI("/user/yourname/cache/some_file.json#some"));
        job.addCacheFile(new URI("/user/yourname/cache/other_file.json#other"));
    
        return job.waitForCompletion(true) ? 0 : 1;
    }
    

    And in your Mapper/Reducer, override the setup(Context context) method:

    @Override
    protected void setup(
            Mapper<LongWritable, Text, Text, Text>.Context context)
            throws IOException, InterruptedException {
        if (context.getCacheFiles() != null
                && context.getCacheFiles().length > 0) {
    
            File some_file = new File("./some");
            File other_file = new File("./other");
    
            // Do things to these two files, like read them
            // or parse as JSON or whatever.
        }
        super.setup(context);
    }
    
    0 讨论(0)
  • 2020-11-28 05:19

    The new DistributedCache API for YARN/MR2 is found in the org.apache.hadoop.mapreduce.Job class.

       Job.addCacheFile()
    

    Unfortunately, there aren't as of yet many comprehensive tutorial-style examples of this.

    http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/Job.html#addCacheFile%28java.net.URI%29

    0 讨论(0)
提交回复
热议问题