Get CSV to Spark dataframe

前端未结

关注

 9  1208

忘了有多久

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

相关标签:

9条回答

青春惊慌失措

2020-12-05 15:13
With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:
```
df = sqlContext.read.csv("/path/to/your.csv")
```
Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.
0 讨论(0)
发布评论:

提交评论
- 加载中...

栀梦

2020-12-05 15:15

Read the csv file in to a RDD and then generate a RowRDD from the original RDD.

Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.

Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

lines = sc.textFile("examples/src/main/resources/people.txt")
parts = lines.map(lambda l: l.split(","))
# Each line is converted to a tuple.
people = parts.map(lambda p: (p[0], p[1].strip()))

# The schema is encoded in a string.
schemaString = "name age"

fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
schema = StructType(fields)

# Apply the schema to the RDD.
schemaPeople = spark.createDataFrame(people, schema)

source: SPARK PROGRAMMING GUIDE

0 讨论(0)

感情败类

2020-12-05 15:15

from pyspark.sql.types import StringType
from pyspark import SQLContext
sqlContext = SQLContext(sc)

Employee_rdd = sc.textFile("\..\Employee.csv")
               .map(lambda line: line.split(","))

Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])

Employee_df.show()

0 讨论(0)

上一页 1 2