Get CSV to Spark dataframe

前端 未结 9 1208
忘了有多久
忘了有多久 2020-12-05 14:45

I\'m using python on Spark and would like to get a csv into a dataframe.

The documentation for Spark SQL strangely does not provide explanations for CSV as a source.

相关标签:
9条回答
  • 2020-12-05 15:13

    With more recent versions of Spark (as of, I believe, 1.4) this has become a lot easier. The expression sqlContext.read gives you a DataFrameReader instance, with a .csv() method:

    df = sqlContext.read.csv("/path/to/your.csv")
    

    Note that you can also indicate that the csv file has a header by adding the keyword argument header=True to the .csv() call. A handful of other options are available, and described in the link above.

    0 讨论(0)
  • 2020-12-05 15:15

    Read the csv file in to a RDD and then generate a RowRDD from the original RDD.

    Create the schema represented by a StructType matching the structure of Rows in the RDD created in Step 1.

    Apply the schema to the RDD of Rows via createDataFrame method provided by SQLContext.

    lines = sc.textFile("examples/src/main/resources/people.txt")
    parts = lines.map(lambda l: l.split(","))
    # Each line is converted to a tuple.
    people = parts.map(lambda p: (p[0], p[1].strip()))
    
    # The schema is encoded in a string.
    schemaString = "name age"
    
    fields = [StructField(field_name, StringType(), True) for field_name in schemaString.split()]
    schema = StructType(fields)
    
    # Apply the schema to the RDD.
    schemaPeople = spark.createDataFrame(people, schema)
    

    source: SPARK PROGRAMMING GUIDE

    0 讨论(0)
  • 2020-12-05 15:15
    from pyspark.sql.types import StringType
    from pyspark import SQLContext
    sqlContext = SQLContext(sc)
    
    Employee_rdd = sc.textFile("\..\Employee.csv")
                   .map(lambda line: line.split(","))
    
    Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])
    
    Employee_df.show()
    
    0 讨论(0)
提交回复
热议问题