I am trying to run Hadoop Job on Google Compute engine against our compressed data, which is sitting on Google Cloud Storage. While trying to read the data through SequenceFile
A bdutil deployment will contain Snappy by default.
Your last question is the easiest to answer in the general case so I'll begin there. The general guidance for shipping dependencies is that applications should make use of the distributed cache to distribute JARs and libraries to workers (Hadoop 1 or 2). If your code is already making use of the GenericOptionsParser you can distrubte JARs with the -libjars flag. A longer discussion can be found on Cloudera's blog that also discusses fat JARs: http://blog.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/
For installing and configuring other system-level components bdutil supports an extension mechanism. A good example of extensions is the Spark extension bundled with bdutil: extensions/spark/spark_env.sh. When running bdutil extensions are added with the -e flag e.g., to deploy Spark with Hadoop:
./bdutil -e extensions/spark/spark_env.sh deploy
With regard to your first and second questions: there are two obstacles when dealing with Snappy in Hadoop on GCE. The first is that the native support libraries built by Apache and bundled with Hadoop 2 tarballs are built for i386 while GCE instances are amd64. Hadoop 1 bundles binaries for both platforms, but snappy is not locatable without either bundling or modifying the environment. Because of this architecture difference, no native compressors are usable (snappy or otherwise) in Hadoop 2 and Snappy is not available easily in Hadoop 1. The second obstacle is that libsnappy itself is not installed by default.
The easiest way to overcome both of these is to create your own Hadoop tarball containing amd64 native Hadoop libraries as well as libsnappy. The steps below should help you do this and stage the resulting tarball for use by bdutil.
To start, launch a new GCE VM using a Debian Wheezy backports image and grant the VM service account read/write access to Cloud Storage. We'll use this as our build machine and we can safely discard it as soon as we're done building / storing the binary.
SSH to your new instance and run the following commands, checking for any errors along the way:
sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-1.2.1/hadoop-1.2.1.tar.gz
tar zxvf hadoop-1.2.1.tar.gz
pushd hadoop-1.2.1/
# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/Linux-amd64-64/
# Test to make certain Snappy is being loaded and is working:
bin/hadoop jar ./hadoop-test-1.2.1.jar testsequencefile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check
# Create a new tarball of Hadoop 1.2.1:
popd
rm hadoop-1.2.1.tar.gz
tar zcvf hadoop-1.2.1.tar.gz hadoop-1.2.1/
# Store the tarball on GCS:
gsutil cp hadoop-1.2.1.tar.gz gs://<some bucket>/hadoop-1.2.1.tar.gz
SSH to your new instance and run the following commands, checking for any errors along the way:
sudo apt-get update
sudo apt-get install pkg-config libsnappy-dev libz-dev libssl-dev gcc make cmake automake autoconf libtool g++ openjdk-7-jdk maven ant
export JAVA_HOME=/usr/lib/jvm/java-7-openjdk-amd64/
# Protobuf 2.5.0 is required and not in Debian-backports
wget http://protobuf.googlecode.com/files/protobuf-2.5.0.tar.gz
tar xvf protobuf-2.5.0.tar.gz
pushd protobuf-2.5.0/ && ./configure && make && sudo make install && popd
sudo ldconfig
wget http://apache.mirrors.lucidnetworks.net/hadoop/common/hadoop-2.4.1/hadoop-2.4.1-src.tar.gz
# Unpack source
tar zxvf hadoop-2.4.1-src.tar.gz
pushd hadoop-2.4.1-src
# Build Hadoop
mvn package -Pdist,native -DskipTests -Dtar
pushd hadoop-dist/target/
pushd hadoop-2.4.1/
# Bundle libsnappy so we don't have to apt-get install it on each machine
cp /usr/lib/libsnappy* lib/native/
# Test that everything is working:
bin/hadoop jar share/hadoop/common/hadoop-common-2.4.1-tests.jar org.apache.hadoop.io.TestSequenceFile -seed 0 -count 1000 -compressType RECORD xxx -codec org.apache.hadoop.io.compress.SnappyCodec -check
popd
# Create a new tarball with libsnappy:
rm hadoop-2.4.1.tar.gz
tar zcf hadoop-2.4.1.tar.gz hadoop-2.4.1/
# Store the new tarball on GCS:
gsutil cp hadoop-2.4.1.tar.gz gs://<some bucket>/hadoop-2.4.1.tar.gz
popd
popd
Once you have a Hadoop version with the correct native libraries bundled, we can point bdutil at the new Hadoop tarball by updating either bdutil_env.sh for Hadoop 1 or hadoop2_env.sh for Hadoop 2. In either case, open the approprirate file and look for a block along the lines of:
# URI of Hadoop tarball to be deployed. Must begin with gs:// or http(s)://
# Use 'gsutil ls gs://hadoop-dist/hadoop-*.tar.gz' to list Google supplied options
HADOOP_TARBALL_URI='gs://hadoop-dist/hadoop-1.2.1-bin.tar.gz'
and change the URI pointed to to be the URI where we stored the tarball above: e.g.,
HADOOP_TARBALL_URI='gs://<some bucket>/hadoop-1.2.1.tar.gz'