I\'m using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I\'m trying to improve the write time of loading data into hdfs. I tried utilizing
It depends a lot on the details of your setup. First, know that 20GB in 4 mins is 80MBps.
The bottleneck is most likely your local machine's hardware or its ethernet connection. I doubt playing with block size will improve your throughput by much.
If your local machine has a typical 7200rpm hard drive, its disk to buffer transfer rate is about 128MBps, meaning that it could load that 20BG file into memory in about 2:35, assuming you have 20GB to spare. However, you're not just copying it to memory, you're streaming it from memory to network packets, so it's understandable that you incur an additional overhead for processing these tasks.
Also see the wikipedia entry on wire speed, which puts a fast ethernet setup at 100Mbit/s (~12MB/s). Note that in this case fast ethernet is a term for a particular group of ethernet standards. You are clearly getting a faster rate than this. Wire speed is a good measure, because it accounts for all the factors on your local machine.
So let's break down the different steps in the streaming process on your local machine:
Without knowing more about your local machine, it is hard to specify which of these components is the bottleneck. However, these are the places to start investigating bitrate.