How to convert multiple parquet files into TFrecord files using SPARK?

不羁岁月 提交于 2020-02-28 17:24:08

问题


I would like to produce stratified TFrecord files from a large DataFrame based on a certain condition, for which I use write.partitionBy(). I'm also using the tensorflow-connector in SPARK, but this apparently does not work together with a write.partitionBy() operation. Therefore, I have not found another way than to try to work in two steps:

  1. Repartion the dataframe according to my condition, using partitionBy() and write the resulting partitions to parquet files.
  2. Read those parquet files to convert them into TFrecord files with the tensorflow-connector plugin.

It is the second step that I'm unable to do efficiently. My idea was to read in the individual parquet files on the executors and immediately write them into TFrecord files. But this needs access to the SQLContext which can only be done in the Driver (discussed here) so not in parallel. I would like to do something like this:

# List all parquet files to be converted
import glob, os
files = glob.glob('/path/*.parquet'))

sc = SparkSession.builder.getOrCreate()
sc.parallelize(files, 2).foreach(lambda parquetFile: convert_parquet_to_tfrecord(parquetFile))

Could I construct the function convert_parquet_to_tfrecord that would be able to do this on the executors?

I've also tried just using the wildcard when reading all the parquet files:

SQLContext(sc).read.parquet('/path/*.parquet')

This indeed reads all parquet files, but unfortunately not into individual partitions. It appears that the original structure gets lost, so it doesn't help me if I want the exact contents of the individual parquet files converted into TFrecord files.

Any other suggestions?


回答1:


If I understood your question correctly, you want to write the partitions locally on the workers' disk.

If that is the case then I would recommend looking at spark-tensorflow-connector's instructions on how to do so.

This is the code that you are looking for (as stated in the documentation linked above):

myDataFrame.write.format("tfrecords").option("writeLocality", "local").save("/path")  

On a side note, if you are worried about efficiency why are you using pyspark? It would be better to use scala instead.



来源:https://stackoverflow.com/questions/54312284/how-to-convert-multiple-parquet-files-into-tfrecord-files-using-spark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!