I have a DataFrame generated as follows:
df.groupBy($\"Hour\", $\"Category\")
.agg(sum($\"value\").alias(\"TotalValue\"))
.sort($\"Hour\".asc,$\"TotalVal
As noted in my comments, one potentially easy approach to this problem would be to use:
df.write.partitionBy("hour").saveAsTable("myparquet")
As noted, the folder structure would be myparquet/hour=1
, myparquet/hour=2
, ..., myparquet/hour=24
as opposed to myparquet/1
, myparquet/2
, ..., myparquet/24
.
To change the folder structure, you could
hcat.dynamic.partitioning.custom.pattern
within an explicit HiveContext; more information at HCatalog DynamicPartitions. df.write.partitionBy.saveAsTable(...)
command with something like for f in *; do mv $f ${f/${f:0:5}/} ; done
which would remove the Hour=
text from the folder name. It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...)
in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour
) information.
//If you want to divide a dataset into n number of equal datasetssets
double[] arraySplit = {1,1,1...,n}; //you can also divide into ratio if you change the numbers.
List<Dataset<String>> datasetList = dataset.randomSplitAsList(arraySplit,1);
This has been answered here for Spark (Scala):
How can I split a dataframe into dataframes with same column values in SCALA and SPARK
and here for pyspark:
PySpark - Split/Filter DataFrame by column's values