SPARK DataFrame: How to efficiently split dataframe for each group based on same column values

前端 未结 3 1196
花落未央
花落未央 2020-12-31 10:41

I have a DataFrame generated as follows:

df.groupBy($\"Hour\", $\"Category\")
  .agg(sum($\"value\").alias(\"TotalValue\"))
  .sort($\"Hour\".asc,$\"TotalVal         


        
相关标签:
3条回答
  • 2020-12-31 11:26

    As noted in my comments, one potentially easy approach to this problem would be to use:

    df.write.partitionBy("hour").saveAsTable("myparquet")
    

    As noted, the folder structure would be myparquet/hour=1, myparquet/hour=2, ..., myparquet/hour=24 as opposed to myparquet/1, myparquet/2, ..., myparquet/24.

    To change the folder structure, you could

    1. Potentially use the Hive configuration setting hcat.dynamic.partitioning.custom.pattern within an explicit HiveContext; more information at HCatalog DynamicPartitions.
    2. Another approach would be to change the file system directly after you have executed the df.write.partitionBy.saveAsTable(...) command with something like for f in *; do mv $f ${f/${f:0:5}/} ; done which would remove the Hour= text from the folder name.

    It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...) in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour) information.

    0 讨论(0)
  • 2020-12-31 11:30
    //If you want to divide a dataset into n number of equal datasetssets
    double[] arraySplit = {1,1,1...,n}; //you can also divide into ratio if you change the numbers.
    
    List<Dataset<String>> datasetList = dataset.randomSplitAsList(arraySplit,1);
    
    0 讨论(0)
  • 2020-12-31 11:39

    This has been answered here for Spark (Scala):

    How can I split a dataframe into dataframes with same column values in SCALA and SPARK

    and here for pyspark:

    PySpark - Split/Filter DataFrame by column's values

    0 讨论(0)
提交回复
热议问题