Get CSV to Spark dataframe

前端 未结 9 1207
忘了有多久
忘了有多久 2020-12-05 14:45

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

相关标签:
9条回答
  • 2020-12-05 14:50

    I ran into similar problem. The solution is to add an environment variable named as "PYSPARK_SUBMIT_ARGS" and set its value to "--packages com.databricks:spark-csv_2.10:1.4.0 pyspark-shell". This works with Spark's Python interactive shell.

    Make sure you match the version of spark-csv with the version of Scala installed. With Scala 2.11, it is spark-csv_2.11 and with Scala 2.10 or 2.10.5 it is spark-csv_2.10.

    Hope it works.

    0 讨论(0)
  • 2020-12-05 14:53

    for Pyspark, assuming that the first row of the csv file contains a header

    spark = SparkSession.builder.appName('chosenName').getOrCreate()
    df=spark.read.csv('fileNameWithPath', mode="DROPMALFORMED",inferSchema=True, header = True)
    
    0 讨论(0)
  • 2020-12-05 15:00

    Based on the answer by Aravind, but much shorter, e.g. :

    lines = sc.textFile("/path/to/file").map(lambda x: x.split(","))
    df = lines.toDF(["year", "month", "day", "count"])
    
    0 讨论(0)
  • 2020-12-05 15:03

    With the current implementation(spark 2.X) you dont need to add the packages argument, You can use the inbuilt csv implementation

    Additionally as the accepted answer you dont need to create an rdd then enforce schema that has 1 potential problem

    When you read the csv as then it will mark all the fields as string and when you enforce the schema with an integer column you will get exception.

    A better way to do the above would be

     spark.read.format("csv").schema(schema).option("header", "true").load(input_path).show() 
    
    0 讨论(0)
  • 2020-12-05 15:04

    If you do not mind the extra package dependency, you could use Pandas to parse the CSV file. It handles internal commas just fine.

    Dependencies:

    from pyspark import SparkContext
    from pyspark.sql import SQLContext
    import pandas as pd
    

    Read the whole file at once into a Spark DataFrame:

    sc = SparkContext('local','example')  # if using locally
    sql_sc = SQLContext(sc)
    
    pandas_df = pd.read_csv('file.csv')  # assuming the file contains a header
    # If no header:
    # pandas_df = pd.read_csv('file.csv', names = ['column 1','column 2']) 
    s_df = sql_sc.createDataFrame(pandas_df)
    

    Or, even more data-consciously, you can chunk the data into a Spark RDD then DF:

    chunk_100k = pd.read_csv('file.csv', chunksize=100000)
    
    for chunky in chunk_100k:
        Spark_temp_rdd = sc.parallelize(chunky.values.tolist())
        try:
            Spark_full_rdd += Spark_temp_rdd
        except NameError:
            Spark_full_rdd = Spark_temp_rdd
        del Spark_temp_rdd
    
    Spark_DF = Spark_full_rdd.toDF(['column 1','column 2'])
    
    0 讨论(0)
  • 2020-12-05 15:08

    Following Spark 2.0, it is recommended to use a Spark Session:

    from pyspark.sql import SparkSession
    from pyspark.sql import Row
    
    # Create a SparkSession
    spark = SparkSession \
        .builder \
        .appName("basic example") \
        .config("spark.some.config.option", "some-value") \
        .getOrCreate()
    
    def mapper(line):
        fields = line.split(',')
        return Row(ID=int(fields[0]), field1=str(fields[1].encode("utf-8")), field2=int(fields[2]), field3=int(fields[3]))
    
    lines = spark.sparkContext.textFile("file.csv")
    df = lines.map(mapper)
    
    # Infer the schema, and register the DataFrame as a table.
    schemaDf = spark.createDataFrame(df).cache()
    schemaDf.createOrReplaceTempView("tablename")
    
    0 讨论(0)
提交回复
热议问题