问题
In the following example:
small.ints = to.dfs(1:1000)
mapreduce(
input = small.ints,
map = function(k, v) cbind(v, v^2))
The data input for mapreduce function is an object named small.ints which refered to blocks in HDFS.
Now I have a CSV file already stored in HDFS as
"hdfs://172.16.1.58:8020/tmp/test_short.csv"
How to get an object for it?
And as far as I know(which may be wrong), if I want data from CSV file as input for mapreduce, I have to first generate a table in R which contains all values in the CSV file. I do have method like:
data=from.dfs("hdfs://172.16.1.58:8020/tmp/test_short.csv",make.input.format(format="csv",sep=","))
mydata=data$val
It seems OK to use this method to get mydata, and then do object=to.dfs(mydata), but the problem is the test_short.csv file is huge, which is around TB size, and memory can't hold output of from.dfs!!
Actually, I'm wondering if I use "hdfs://172.16.1.58:8020/tmp/test_short.csv" as mapreduce input directly, and inside map function do the from.dfs() thing, am I able to get data blocks?
Please give me some advice, whatever!
回答1:
mapreduce(input = path, input.format = make.input.format(...), map ...)
from.dfs is for small data. In most cases you won't use from.dfs in the map function. The arguments hold a portion of the input data already
回答2:
You can do something like below:
r.file <- hdfs.file(hdfsFilePath,"r")
from.dfs(
mapreduce(
input = as.matrix(hdfs.read.text.file(r.file)),
input.format = "csv",
map = ...
))
Please give points and hope anybody find it useful.
Note: For details refer to the stackoverflow post :
How to input HDFS file into R mapreduce for processing and get the result into HDFS file
来源:https://stackoverflow.com/questions/18093107/rhadoop-how-to-read-csv-file-from-hdfs-and-execute-mapreduce