How to avoid empty files while writing parquet files?

后端未结

关注

 4  886

囚心锁ツ 2021-01-16 07:15

I am reading from Kafka queue using Spark Structured Streaming. After reading from Kafka I am applying filter on the dataframe. I am saving this fi

4条回答

广开言路 (楼主)

2021-01-16 07:48

I recommend using repartition(partitioningColumns) on the Dataframe resp. Dataset and after that partitionBy(partitioningColumns) on the writeStream operation to avoid writing empty files.

Reason: The bottleneck if you have a lot of data is often the read performance with Spark if you have a lot of small (or even empty) files and no partitioning. So you should definitely make use of the file/directory partitioning (which is not the same as RDD partitioning). This is especially a problem when using AWS S3. The partitionColumns should fit your common queries when reading the data like timestamp/day, message type/Kafka topic, ...

See also the partitionBy documentation on http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

Partitions the output by the given columns on the file system. If specified, the output is laid out on the file system similar to Hive's partitioning scheme. As an example, when we partition a dataset by year and then month, the directory layout would look like:

year=2016/month=01/, year=2016/month=02/

Partitioning is one of the most widely used techniques to optimize physical data layout. It provides a coarse-grained index for skipping unnecessary data reads when queries have predicates on the partitioned columns. In order for partitioning to work well, the number of distinct values in each column should typically be less than tens of thousands.

This is applicable for all file-based data sources (e.g. Parquet, JSON) staring Spark 2.1.0.

0 讨论(0)

查看其它4个回答
发布评论:

提交评论
- 加载中...