Hadoop put performance - large file (20gb)

后端 未结 3 1981
佛祖请我去吃肉
佛祖请我去吃肉 2021-02-04 10:03

I\'m using hdfs -put to load a large 20GB file into hdfs. Currently the process runs @ 4mins. I\'m trying to improve the write time of loading data into hdfs. I tried utilizing

3条回答
  •  孤独总比滥情好
    2021-02-04 10:20

    It depends a lot on the details of your setup. First, know that 20GB in 4 mins is 80MBps.

    The bottleneck is most likely your local machine's hardware or its ethernet connection. I doubt playing with block size will improve your throughput by much.

    If your local machine has a typical 7200rpm hard drive, its disk to buffer transfer rate is about 128MBps, meaning that it could load that 20BG file into memory in about 2:35, assuming you have 20GB to spare. However, you're not just copying it to memory, you're streaming it from memory to network packets, so it's understandable that you incur an additional overhead for processing these tasks.

    Also see the wikipedia entry on wire speed, which puts a fast ethernet setup at 100Mbit/s (~12MB/s). Note that in this case fast ethernet is a term for a particular group of ethernet standards. You are clearly getting a faster rate than this. Wire speed is a good measure, because it accounts for all the factors on your local machine.

    So let's break down the different steps in the streaming process on your local machine:

    • Read a chunk from file and load it into memory. Components: hard drive, memory
    • Split and translate that chunk into packets. Last I heard Hadoop doesn't use DMA features out of the box, so these operations will be performed by your CPU rather than the NIC. Components: Memory, CPU
    • Transmit packets to hadoop file servers. Components: NIC, Network

    Without knowing more about your local machine, it is hard to specify which of these components is the bottleneck. However, these are the places to start investigating bitrate.

提交回复
热议问题