Load Data using Apache-Spark on AWS

前端 未结 2 2039
攒了一身酷
攒了一身酷 2021-01-28 17:31

I am using Apache-Spark on Amazon Web Service (AWS)-EC2 to load and process data. I\'ve created one master and two slave nodes. On the master node, I have a directory data

2条回答
  •  长情又很酷
    2021-01-28 17:57

    Just to clarify for others that may come across this post.

    I believe your confusion is due to not providing a protocol in the file location. When you do the following line:

    ### Create a RDD containing metadata about files in directory "data"
    datafile = sc.wholeTextFiles("/root/data")  ### Read data directory 
    

    Spark assumes the file path /root/data is in HDFS. In other words it looks for the files at hdfs:///root/data.

    You only need the files in one location, either locally on every node (not the most efficient in terms of storage) or in HDFS that is distributed across the nodes.

    If you wish to read files from local, use file:///path/to/local/file. If you wish to use HDFS use hdfs:///path/to/hdfs/file.

    Hope this helps.

提交回复
热议问题