When I run a Spark job and save the output as a text file using method \"saveAsTextFile\" as specified at https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.
Those are files generated by the underlying Hadoop API that Spark calls when you invoke saveAsTextFile()
.
part-
files: These are your output data files.
You will have one part-
file per partition in the RDD you called saveAsTextFile()
on. Each of these files will be written out in parallel, up to a certain limit (typically, the number of cores on the workers in your cluster). This means you will write your output much faster that it would be written out if it were all put in a single file, assuming your storage layer can handle the bandwidth.
You can check the number of partitions in your RDD, which should tell you how many part-
files to expect, as follows:
# PySpark
# Get the number of partitions of my_rdd.
my_rdd._jrdd.splits().size()
_SUCCESS
file: The presence of an empty _SUCCESS
file simply means that the operation completed normally.
.crc
files: I have not seen the .crc
files before, but yes, presumably they are checks on the part-
files.