R+Hadoop: How to read CSV file from HDFS and execute mapreduce?

问题

In the following example:

  small.ints = to.dfs(1:1000)
  mapreduce(
    input = small.ints, 
    map = function(k, v) cbind(v, v^2))

The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.

Now I have a CSV file already stored in HDFS as

"hdfs://172.16.1.58:8020/tmp/test_short.csv"

How to get an object for it?

And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:

data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val

It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!

Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?

Please give me some advice, whatever!

回答1:

mapreduce(input = path, input.format = make.input.format(...), map ...)

from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already

回答2:

You can do something like below:

r.file <- hdfs.file(hdfsFilePath,"r")
from.dfs(
    mapreduce(
         input = as.matrix(hdfs.read.text.file(r.file)),
         input.format = "csv",
         map = ...
))

Please give points and hope anybody find it useful.

Note: For details refer to the stackoverflow post :

How to input HDFS file into R mapreduce for processing and get the result into HDFS file

来源：https://stackoverflow.com/questions/18093107/rhadoop-how-to-read-csv-file-from-hdfs-and-execute-mapreduce

标签

Hadoop

rhadoop