I\'m using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I\'m trying to improve the write time of loading data into hdfs. I tried utilizing
20GB / 4minute comes out to about 85MB/sec. That's pretty reasonable throughput to expect from a single drive with all the overhead of HDFS protocol and network. I'm betting that is your bottleneck. Without changing your ingest process, you're not going to be able to make this magically faster.
The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).
Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.
I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put
in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.