How to export a table dataframe in PySpark to csv?

后端 未结 5 811
半阙折子戏
半阙折子戏 2020-11-27 02:33

I am using Spark 1.3.1 (PySpark) and I have generated a table using a SQL query. I now have an object that is a DataFrame. I want to export this DataFrame

相关标签:
5条回答
  • 2020-11-27 03:12

    For Apache Spark 2+, in order to save dataframe into single csv file. Use following command

    query.repartition(1).write.csv("cc_out.csv", sep='|')
    

    Here 1 indicate that I need one partition of csv only. you can change it according to your requirements.

    0 讨论(0)
  • 2020-11-27 03:13

    You need to repartition the Dataframe in a single partition and then define the format, path and other parameter to the file in Unix file system format and here you go,

    df.repartition(1).write.format('com.databricks.spark.csv').save("/path/to/file/myfile.csv",header = 'true')
    

    Read more about the repartition function Read more about the save function

    However, repartition is a costly function and toPandas() is worst. Try using .coalesce(1) instead of .repartition(1) in previous syntax for better performance.

    Read more on repartition vs coalesce functions.

    0 讨论(0)
  • 2020-11-27 03:15

    If data frame fits in a driver memory and you want to save to local files system you can convert Spark DataFrame to local Pandas DataFrame using toPandas method and then simply use to_csv:

    df.toPandas().to_csv('mycsv.csv')
    

    Otherwise you can use spark-csv:

    • Spark 1.3

      df.save('mycsv.csv', 'com.databricks.spark.csv')
      
    • Spark 1.4+

      df.write.format('com.databricks.spark.csv').save('mycsv.csv')
      

    In Spark 2.0+ you can use csv data source directly:

    df.write.csv('mycsv.csv')
    
    0 讨论(0)
  • 2020-11-27 03:15

    How about this (in you don't want an one liner) ?

    for row in df.collect():
        d = row.asDict()
        s = "%d\t%s\t%s\n" % (d["int_column"], d["string_column"], d["string_column"])
        f.write(s)
    

    f is a opened file descriptor. Also the separator is a TAB char, but it's easy to change to whatever you want.

    0 讨论(0)
  • 2020-11-27 03:27

    If you cannot use spark-csv, you can do the following:

    df.rdd.map(lambda x: ",".join(map(str, x))).coalesce(1).saveAsTextFile("file.csv")
    

    If you need to handle strings with linebreaks or comma that will not work. Use this:

    import csv
    import cStringIO
    
    def row2csv(row):
        buffer = cStringIO.StringIO()
        writer = csv.writer(buffer)
        writer.writerow([str(s).encode("utf-8") for s in row])
        buffer.seek(0)
        return buffer.read().strip()
    
    df.rdd.map(row2csv).coalesce(1).saveAsTextFile("file.csv")
    
    0 讨论(0)
提交回复
热议问题