How to avoid empty files while writing parquet files?

后端 未结 4 885
囚心锁ツ
囚心锁ツ 2021-01-16 07:15

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this fi

4条回答
  •  星月不相逢
    2021-01-16 07:52

    Is there any way I can stop writing an empty file.

    Yes, but you would rather not do it.

    The reason for many empty parquet files is that Spark SQL (the underlying infrastructure for Structured Streaming) tries to guess the number of partitions to load a dataset (with records from Kafka per batch) and does this "poorly", i.e. many partitions have no data.

    When you save a partition with no data you will get an empty file.

    You can use repartition or coalesce operators to set the proper number of partitions and reduce (or even completely avoid) empty files. See Dataset API.

    Why would you not do it? repartition and coalesce may incur performance degradation due to the extra step of shuffling the data between partitions (and possibly nodes in your Spark cluster). That can be expensive and not worth doing it (and hence I said that you would rather not do it).

    You may then be asking yourself, how to know the right number of partitions? And that's a very good question in any Spark project. The answer is fairly simple (and obvious if you understand what and how Spark does the processing): "Know your data" so you can calculate how many is exactly right.

提交回复
热议问题