Get CSV to Spark dataframe

前端未结

关注

 9  1207

忘了有多久

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

相关标签:

9条回答

生来不讨喜

2020-12-05 14:50

I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.

Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.

Hope it works.

0 讨论(0)
发布评论:

提交评论
- 加载中...

有刺的猬

2020-12-05 14:53

for Pyspark, assuming that the first row of the csv file contains a header

spark = SparkSession.builder.appName('chosenName').getOrCreate()
df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)

0 讨论(0)

栀梦

2020-12-05 15:00
Based on the answer by Aravind, but much shorter, e.g. :
```
lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
df = lines.toDF(["year", "month", "day", "count"])
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
礼貌的吻别

2020-12-05 15:03
With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation

Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem

When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.

A better way to do the above would be
```
 spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show() 
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

南旧

2020-12-05 15:04

If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.

Dependencies:

from pyspark import SparkContext
from pyspark.sql import SQLContext
import pandas as pd

Read the whole file at once into a Spark DataFrame:

sc = SparkContext('local','example')  # if using locally
sql_sc = SQLContext(sc)

pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
# If no header:
# pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) 
s_df = sql_sc.createDataFrame(pandas_df)

Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:

chunk_100k = pd.read_csv('file.csv', chunksize=100000)

for chunky in chunk_100k:
    Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
    try:
        Spark_full_rdd += Spark_temp_rdd
    except NameError:
        Spark_full_rdd = Spark_temp_rdd
    del Spark_temp_rdd

Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])

0 讨论(0)

谎友^

2020-12-05 15:08

Following Spark 2.0, it is recommended to use a Spark Session:

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession \
    .builder \
    .appName("basic example") \
    .config("spark.some.config.option", "some-value") \
    .getOrCreate()

def mapper(line):
    fields = line.split(',')
    return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))

lines = spark.sparkContext.textFile("file.csv")
df = lines.map(mapper)

# Infer the schema, and register the DataFrame as a table.
schemaDf = spark.createDataFrame(df).cache()
schemaDf.createOrReplaceTempView("tablename")

0 讨论(0)

1 2 下一页