Spark: Order of column arguments in repartition vs partitionBy

后端 未结 2 1688
时光说笑
时光说笑 2021-01-02 05:40

Methods taken into consideration (Spark 2.2.1):

  1. DataFrame.repartition (the two implementations that take partitionExprs: Column
2条回答
  •  说谎
    说谎 (楼主)
    2021-01-02 06:02

    Before answering this question, let me clear you about some concepts in spark.

    block: These are physically mapped to HDFS Folder and are capable of storing sub blocks and parquet/* files.

    parquet: data store compressed files, commonly used in HDFS clusters to store data.

    now coming to the answer.

    Repartition(number_of_partitions, *columns) : this will create parquet files with data shuffled and sorted on the distinct combination values of the columns provided. therefore order of column doesn't make any difference here. you can provide any order in the background spark will get all the possible value of these columns, sort them and arrange the data in the files which will sum to the number_of_partitions .

    PartionBy(*columns): this is slightly different from repartition. this will create blocks or folders in the HDFS with distinct values of columns provided in the parameters. so suppose:

    Col A = [1,2,3,4,5]

    while writing the table HDFS it will create the folder names colA-1

    colA-2

    colA-3 . . . and if you provide two columns then

    colA-1/ colB-1 colB-2 colB-3 . .

    colA-2/

    colA-3/ . . .

    and inside this it will store parquet files which will have data sorted on the parent column value. the number of files in this folder will be fixed by (bucketBy) attribute which will further suggest the maximum number of files in each folder. this is only available in pyspark 2.3 and in scala 1.6 onward.

提交回复
热议问题