What are the files generated by Spark when using “saveAsTextFile”?

后端未结

关注

 1  1682

When I run a Spark job and save the output as a text file using method \"saveAsTextFile\" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.

相关标签:

1条回答

迷失自我

2021-01-01 15:48
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile().
- part- files: These are your output data files.
  
  You will have one part- file per partition in the RDD you called saveAsTextFile() on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
  
  You can check the number of partitions in your RDD, which should tell you how many part- files to expect, as follows:
```
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
```
- _SUCCESS file: The presence of an empty _SUCCESS file simply means that the operation completed normally.
- .crc files: I have not seen the .crc files before, but yes, presumably they are checks on the part- files.
0 讨论(0)
发布评论:

提交评论
- 加载中...