Hadoop put performance - large file (20gb)

后端 未结 3 1980
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-04 10:03

I\'m using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I\'m trying to improve the write time of loading data into hdfs. I tried utilizing

相关标签:
3条回答
  • 2021-02-04 10:20

    It depends a lot on the details of your setup. First, know that 20GB in 4 mins is 80MBps.

    The bottleneck is most likely your local machine's hardware or its ethernet connection. I doubt playing with block size will improve your throughput by much.

    If your local machine has a typical 7200rpm hard drive, its disk to buffer transfer rate is about 128MBps, meaning that it could load that 20BG file into memory in about 2:35, assuming you have 20GB to spare. However, you're not just copying it to memory, you're streaming it from memory to network packets, so it's understandable that you incur an additional overhead for processing these tasks.

    Also see the wikipedia entry on wire speed, which puts a fast ethernet setup at 100Mbit/s (~12MB/s). Note that in this case fast ethernet is a term for a particular group of ethernet standards. You are clearly getting a faster rate than this. Wire speed is a good measure, because it accounts for all the factors on your local machine.

    So let's break down the different steps in the streaming process on your local machine:

    • Read a chunk from file and load it into memory. Components: hard drive, memory
    • Split and translate that chunk into packets. Last I heard Hadoop doesn't use DMA features out of the box, so these operations will be performed by your CPU rather than the NIC. Components: Memory, CPU
    • Transmit packets to hadoop file servers. Components: NIC, Network

    Without knowing more about your local machine, it is hard to specify which of these components is the bottleneck. However, these are the places to start investigating bitrate.

    0 讨论(0)
  • 2021-02-04 10:30

    you may want to use distcp hadoop distcp -Ddfs.block.size=$[256*1024*1024] /path/to/inputdata /path/to/outputdata to perform parallel copy

    0 讨论(0)
  • 2021-02-04 10:34

    20GB / 4minute comes out to about 85MB/sec. That's pretty reasonable throughput to expect from a single drive with all the overhead of HDFS protocol and network. I'm betting that is your bottleneck. Without changing your ingest process, you're not going to be able to make this magically faster.

    The core problem is that 20GB is a decent amount of data and that data getting pushed into HDFS as a single stream. You are limited by disk I/O which is pretty lame given you have a large number of disks in a Hadoop cluster.. You've got a while to go to saturate a 10GigE network (and probably a 1GigE, too).

    Changing block size shouldn't change this behavior, as you saw. It's still the same amount of data off disk into HDFS.

    I suggest you split the file up into 1GB files and spread them over multiple disks, then push them up with -put in parallel. You might want even want to consider splitting these files over multiple nodes if network becomes a bottleneck. Can you change the way you receive your data to make this faster? Obvious splitting the file and moving it around will take time, too.

    0 讨论(0)
提交回复
热议问题