What are the files generated by Spark when using “saveAsTextFile”?

后端 未结 1 1682
余生分开走
余生分开走 2021-01-01 15:17

When I run a Spark job and save the output as a text file using method \"saveAsTextFile\" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.

相关标签:
1条回答
  • 2021-01-01 15:48

    Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().

    • part- files: These are your output data files.

      You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.

      You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:

      # PySpark
      # Get the number of partitions of my_rdd.
      my_rdd._jrdd.splits().size()
      
    • _SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.

    • .crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.

    0 讨论(0)
提交回复
热议问题