If I have a constructor that requires a path to a file, how can I “fake” that if it is packaged into a jar?

前端 未结 6 1434
半阙折子戏
半阙折子戏 2020-12-17 05:09

The context of this question is that I am trying to use the maxmind java api in a pig script that I have written... I do not think that knowing about either is necessary to

相关标签:
6条回答
  • 2020-12-17 05:31

    dump your data to a temp file, and feed the temp file to it.

    File tmpFile = File.createTempFile("XX", "dat");
    tmpFile.deleteOnExit();
    
    InputStream is = MyClass.class.getResourceAsStream("/path/in/jar/XX.dat");
    OutputStream os = new FileOutputStream(tmpFile)
    
    read from is, write to os, close
    
    0 讨论(0)
  • 2020-12-17 05:35

    One recommended way is to use the Distributed Cache rather than trying to bundle it into a jar.

    If you zip GeoIP.dat and copy it on hdfs://host:port/path/GeoIP.dat.zip. Then add these options to the Pig command:

    pig ...
      -Dmapred.cache.archives=hdfs://host:port/path/GeoIP.dat.zip#GeoIP.dat 
      -Dmapred.create.symlink=yes
    ...
    

    And LookupService lookupService = new LookupService("./GeoIP.dat"); should work in your UDF as the file will be present locally to the tasks on each node.

    0 讨论(0)
  • 2020-12-17 05:38

    Use the classloader.getResource(...) method to do the file lookup in the classpath, which will pull it from the JAR file.

    This means you will have to alter the existing code to override the loading. The details on how to do that depend heavily on your existing code and environment. In some cases subclassing and registering the subclass with the framework might work. In other cases, you might have to determine the ordering of class loading along the classpath and place an identically signed class "earlier" in the classpath.

    0 讨论(0)
  • 2020-12-17 05:45

    Try:

    new File(MyWrappingClass.class.getResource(<resource>).toURI())
    
    0 讨论(0)
  • 2020-12-17 05:45

    This works for me.

    Assuming you have a package org.foo.bar.util that contains GeoLiteCity.dat

    URL fileURL = this.getClass().getResource("org/foo/bar/util/GeoLiteCity.dat");
    File geoIPData = new File(fileURL.toURI());
    LookupService cl = new LookupService(geoIPData, LookupService.GEOIP_MEMORY_CACHE );
    
    0 讨论(0)
  • 2020-12-17 05:57

    Here's how we use the maxmind geoIP;

    We put the GeoIPCity.dat file into the cloud and use the cloud location as an argument when we launch the process. The code where we get the GeoIPCity.data file and create a new LookupService is:

    if (DistributedCache.getLocalCacheFiles(context.getConfiguration()) != null) {
        List<Path> localFiles = Utility.arrayToList(DistributedCache.getLocalCacheFiles(context.getConfiguration()));
        for (Path localFile : localFiles) {
            if ((localFile.getName() != null) && (localFile.getName().equalsIgnoreCase("GeoIPCity.dat"))) {
                m_geoipLookupService = new LookupService(new File(localFile.toUri().getPath()));
            }
        }
    }
    

    Here is an abbreviated version of command we use to run our process

    $HADOOP_HOME/bin/hadoop jar /usr/lib/COMPANY/analytics/libjars/MyJar.jar -files hdfs://PDHadoop1.corp.COMPANY.com:54310/data/geoip/GeoIPCity.dat -libjars /usr/lib/COMPANY/analytics/libjars/geoiplookup.jar

    The critical components of this for running the MindMax component is the -files and -libjars. These are generic options in the GenericOptionsParser.

    -files <comma separated list of files> specify comma separated files to be copied to the map reduce cluster
    -libjars <comma separated list of jars> specify comma separated jar files to include in the classpath.

    I'm assuming that Hadoop uses the GenericOptionsParser because I can find no reference to it anywhere in my project. :)

    If you put the GeoIPCity.dat on the could and specify its using the -files argument, it will be put into the local cache which the mapper can then get in the setup function. It doesn't have to be in setup but only needs to be done once per mapper so is an excellent place to put it. Then use the -libjars argument to specify the geoiplookup.jar (or whatever you've called yours) and it will be able to use it. We don't put the geoiplookup.jar on the cloud. I'm rolling with the assumption that hadoop will distribute the jar as it needs to.

    I hope that all makes sense. I am getting fairly familiar with hadoop/mapreduce, but I didnt' write the pieces that use the maxmind geoip component in the project, so I've had to do a little digging to understand it well enough to do the explanation I have here.

    EDIT: Additional description for the -files and -libjars -files The files argument is used to distribute files through Hadoop Distributed Cache. In the example above, we are distributing the Max Mind geo-ip data file through the Hadoop Distributed Cache. We need access to the Max Mind geo-ip data file to map the user’s ip address to appropriate country, region, city, timezone. The API requires that data file be present locally which is not feasible in a distributed processing environment (we will not be guaranteed which nodes in the cluster will process the data). To distribute the appropriate data to the processing node, we use the Hadoop Distributed Cache infrastructure. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –file argument. Please note that the file we distribute should be available in the cloud (HDFS). -libjars The –libjars is used to distribute any additional dependencies required by the map-reduce jobs. Like the data file, we also need to copy the dependent libraries to the nodes in the cluster where the job will be run. The GenericOptionsParser and the ToolRunner automatically facilitate this using the –libjars argument.

    0 讨论(0)
提交回复
热议问题