Have any of you tried Hadoop? Can it be used without the distributed filesystem that goes with it, in a Share-nothing architecture? Would that make sense?
I\'m also inte
If you're just getting your feet wet, start out by downloading CDH4 & running it. You can easily install into a local Virtual Machine and run in "pseudo-distributed mode" which closely mimics how it would run in a real cluster.
The best way to wrap your head around Hadoop is to download it and start exploring the include examples. Use a Linux box/VM and your setup will be much easier than Mac or Windows. Once you feel comfortable with the samples and concepts, then start to see how your problem space might map into the framework.
A couple resources you might find useful for more info on Hadoop:
Hadoop Summit Videos and Presentations
Hadoop: The Definitive Guide: Rough Cuts Version - This is one of the few (only?) books available on Hadoop at this point. I'd say it's worth the price of the electronic download option even at this point ( the book is ~40% complete ).
Yes, you can use Hadoop on a local filesystem by using file URIs instead of hdfs URIs in various places. I think a lot of the examples that come with Hadoop do this.
This is probably fine if you just want to learn how Hadoop works and the basic map-reduce paradigm, but you will need multiple machines and a distributed filesystem to get the real benefits of the scalability inherent in the architecture.
As Joe said, you can indeed use Hadoop without HDFS. However, throughput depends on the cluster's ability to do computation near where data is stored. Using HDFS has 2 main benefits IMHO 1) computation is spread more evenly across the cluster (reducing the amount of inter-node communication) and 2) the cluster as a whole is more resistant to failure due to data unavailability.
If your data is already partitioned or trivially partitionable, you may want to look into supplying your own partitioning function for your map-reduce task.
Great theoretical answers above.
To change your hadoop file system to local, you can change it in "core-site.xml" configuration file like below for hadoop versions 2.x.x.
<property>
<name>fs.defaultFS</name>
<value>file:///</value>
</property>
for hadoop versions 1.x.x.
<property>
<name>fs.default.name</name>
<value>file:///</value>
</property>
Yes You can Use local file system using file:// while specifying the input file etc and this would work also with small data sets.But the actual power of hadoop is based on distributed and sharing mechanism. But Hadoop is used for processing huge amount of data.That amount of data cannot be processed by a single local machine or even if it does it will take lot of time to finish the job.Since your input file is on a shared location(HDFS) multiple mappers can read it simultaneously and reduces the time to finish the job. In nutshell You can use it with local file system but to meet the business requirement you should use it with shared file system.