How to convert multiple parquet files into TFrecord files using SPARK?

问题

I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:

Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.

It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:

# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))

sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))

Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?

I've also tried just using the wildcard when reading all the parquet files:

SQLContext(sc).read.parquet('/path/*.parquet')

This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.

Any other suggestions?

回答1:

If I understood your question correctly, you want to write the partitions locally on the workers' disk.

If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.

This is the code that you are looking for (as stated in the documentation linked above):

myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")

On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.

来源：https://stackoverflow.com/questions/54312284/how-to-convert-multiple-parquet-files-into-tfrecord-files-using-spark

标签

apache-spark

pyspark

pyspark-sql

parquet

tfrecord