发表新帖

发表新帖

Using spark dataFrame to load data from HDFS

后端未结

关注

 2  1102

Can we use DataFrame while reading data from HDFS. I have a tab separated data in HDFS.

I googled, but saw it can be used with NoSQL data

相关标签:

2条回答

后悔当初

2021-01-13 11:36

If I am understanding correctly, you essentially want to read data from the HDFS and you want this data to be automatically converted to a DataFrame.

If that is the case, I would recommend you this spark csv library. Check this out, it has a very good documentation.

0 讨论(0)
发布评论:

提交评论
- 加载中...
野性不改

2021-01-13 11:54
DataFrame is certainly not limited to NoSQL data sources. Parquet, ORC and JSON support is natively provided in 1.4 to 1.6.1; text delimited files are supported using the spark-cvs package.

If you have your tsv file in HDFS at /demo/data then the following code will read the file into a DataFrame
```
sqlContext.read.
  format("com.databricks.spark.csv").
  option("delimiter","\t").
  option("header","true").
  load("hdfs:///demo/data/tsvtest.tsv").show
```
To run the code from spark-shell use the following:
```
--packages com.databricks:spark-csv_2.10:1.4.0
```
In Spark 2.0 csv is natively supported so you should be able to do something like this:
```
spark.read.
  option("delimiter","\t").
  option("header","true").
  csv("hdfs:///demo/data/tsvtest.tsv").show
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题