Writing to a file in HDFS in Hadoop

◇◆丶佛笑我妖孽 提交于 2019-12-23 09:37:13

问题


I was looking for a Disk intensive Hadoop application to test the I/O activity in Hadoop but I couldn't find any such application which kept the Disk utilization above, say 50% or some such application which actually keeps disk busy. I tried randomwriter, but that surprisingly is not disk I/o intensive.

So, I wrote a tiny program to create a file in Mapper and write some text into it. This application works well, but the utilization is high only in the master node which is also name node, job tracker and one of the slaves. The disk utilization is NIL or negligible in the other task trackers. I'm unable to understand why disk I/O is so low in task trackers. Could anyone please nudge me in right direction if I'm doing something wrong? Thanks in advance.

Here is my sample code segment that I wrote in WordCount.java file to create and write UTF string into a file-

Configuration conf = new Configuration();
FileSystem fs = FileSystem.get(conf);
Path outFile;
while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(word, one);
    outFile = new Path("./dummy"+ context.getTaskAttemptID());
    FSDataOutputStream out = fs.create(outFile);

    out.writeUTF("helloworld");
    out.close();
    fs.delete(outFile);
  }

回答1:


I think that any mechanism which creates java objects per cell in each row, and run any doing serialization of the java objects before saving it to disk has little chance to utilize IO.
In my experience serialization is working in speed of several MBs per second or a bit more, but not 100 MB per second.
So what you did avoiding hadoop layers on the output path is quite right. Now lets consider how write to HDFS works. The data is written to the local disk via local datanode, and then synchronously to other nodes in the network, depending on your replication factor. In this case you can not write more data into HDFS then Your network bandwidth. If your cluster is relatively small things get worth. For 3 node cluster and triple replication you will path all data to all nodes so whole cluster HDFS write bandwidth will be about 1 GBit - if you have such network.
So, I would suggest to:
a) Reduce replication factor to 1, thus stop being bound by network.
b) Write bigger chunks of data in one call to mapper




回答2:


OK. I must have been really stupid for not checking before. The actual problem was that all of my data nodes were not really running. I reformatted the namenode and everything fell back into place, I was getting a utilization of 15-20% which is not bad for WC. I will run it for the TestDFSIO and see if I could utilize the Disk even more.



来源:https://stackoverflow.com/questions/13457934/writing-to-a-file-in-hdfs-in-hadoop

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!