I need to create a hadoop job jar file that uses mahout and a bunch of other libraries. I need ti be able to run the job without needing additional jar.files such that all r
Hadoop has the ability to read jars-in-jar. Amend you Ant script to include all the dependency jars in a folder called lib, and add this lib folder to your output Jar. This is sometimes a better choice if you have number of larger jars as it decreases your jar build time.
See this article on a number of options you have when using 3rd party libs with hadoop
Note that the additional jars have to be put under a lib/ subdirectory (Yes, jars within a jar). I use the following maven assembly, which I found somewhere else.
<assembly xmlns="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/plugins/maven-assembly-plugin/assembly/1.1.0 http://maven.apache.org/xsd/assembly-1.1.0.xsd">
<id>job</id>
<formats>
<format>jar</format>
</formats>
<includeBaseDirectory>false</includeBaseDirectory>
<dependencySets>
<dependencySet>
<unpack>false</unpack>
<scope>runtime</scope>
<outputDirectory>lib</outputDirectory>
<excludes>
<exclude>org.apache.hadoop:hadoop-core</exclude>
<exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
</excludes>
</dependencySet>
<dependencySet>
<unpack>false</unpack>
<scope>system</scope>
<outputDirectory>lib</outputDirectory>
<excludes>
<exclude>${artifact.groupId}:${artifact.artifactId}</exclude>
</excludes>
</dependencySet>
</dependencySets>
<fileSets>
<fileSet>
<directory>${basedir}/target/classes</directory>
<outputDirectory>/</outputDirectory>
<excludes>
<exclude>*.jar</exclude>
</excludes>
</fileSet>
</fileSets>
</assembly>
In the generic sense, it is sometimes impossible, as JAR files have resources that must be in particular locations, and two conflicting but necessary resources might prevent the combination (Think META-INF/MANIFEST.MF)
However, in many cases it is very easy. Basically you unzip the JAR file to be added (it is a zip file format) and "add" the classes and what-not to the existing JAR file.
A better choice if you are making an executable JAR file is to add a ClassPath entry into your launching MANIFEST.MF and ship both JAR files in a directory structure compatible with your added ClassPath entry.
Jar
is just a Zip
container.
You can manually unzip and modify your Jar
file with the classes needed, or you can make use of e.g., the jar-with-dependencies descriptor of the Maven build system.
Configure your build file to copy all the referenced classes to the build directory. For example, in ant
:
<path id="classpathunjar">
<fileset dir="${lib.dir}" includes="*.jar" excludes="sqljdbc4.jar"/>
</path>
<target name="compile" depends="clean">
...
<unjar dest="${build.dir}">
<path refid="classpathunjar" />
</unjar>
...
</target>
But it is better if you can manage without doing this. Use the libjars
feature to load the jars into all nodes if you are doing this for running mapreduce jobs on a hadoop cluster