PySpark: Partitioning and hashing multiple dataframes, then joining

问题

Background: I am working with clinical data with a lot of different .csv/.txt files. All these files are patientID based, but with different fields. I am importing these files into DataFrames, which I will join at a later stage after first processing each of these DataFrames individually. I have shown examples of two DataFrames below (df_A and df_B). Similarly, I have multiple DataFrames - df_A, df_B, df_C .... df_J and I will join all of them at a later stage.

df_A = spark.read.schema(schema).format("csv").load(...)....            # Just an example
df_A.show(3)
#Example 1:
+----------+-----------------+
| patientID|   diagnosis_code|
+----------+-----------------+
|       A51|             XIII|
|       B22|               VI|
|       B13|               XV|
+----------+-----------------+
df_B.show(3)
#Example 2:
+-----------+----------+-------+-------------+--------+
|  patientID|  hospital|   city|  doctor_name|    Bill|
+-----------+----------+-------+-------------+--------+
|        A51|   Royal H| London|C.Braithwaite|  451.23|
|        B22|Surgery K.|  Leeds|      J.Small|   88.00|
|        B22|Surgery K.|  Leeds|      J.Small|  102.01|
+-----------+----------+-------+-------------+--------+
print("Number of partitions: {}".format(df_A.rdd.getNumPartitions()))# Num of partitions: 1
print("Partitioner: {}".format(df_A.rdd.partitioner))                # Partitioner: None

Number of partitions: 1 #With other DataFrames I get more partitions.
Partitioner: None

After reading all these .csv/.txt files into DataFrames, I can see that for some DataFrames the data is distributed on just 1 partition (like above), but for others DataFrames, there could be more partitions, depending upon the size of the corresponding .csv/.txt file, which in turn influences number of blocks created (128 MB default size in HDFS). We also don't have a partitioner at the moment.

Question: Now, is it not a good idea to redistribute these DataFrames on multiple partitions, hashed on the basis of patientID, so that we can avoid as much shuffling as possible when we join() these multiple DataFrames? If indeed, this is what is desired, then should I do repartitioning on patientID basis and have a same partitioner for all DataFrames(not sure if it's possible)? I have also read that DataFrame does everything on it's own, but should we not specify hashing according to column patientID?

I will really appreciate if someone can provide some useful links or cues on what optimization strategy should one employ when dealing with these multiple DataFrames, all patientID based.

来源：https://stackoverflow.com/questions/53431989/pyspark-partitioning-and-hashing-multiple-dataframes-then-joining

标签

python

apache-spark

hash

pyspark

hadoop-partitioning