问题
Background: I am working with clinical data with a lot of different .csv/.txt
files. All these files are patientID based, but with different fields. I am importing these files into DataFrames
, which I will join
at a later stage after first processing each of these DataFrames
individually. I have shown examples of two DataFrames
below (df_A
and df_B
). Similarly, I have multiple DataFrames
- df_A
, df_B
, df_C
.... df_J
and I will join
all of them at a later stage.
df_A = spark.read.schema(schema).format("csv").load(...).... # Just an example
df_A.show(3)
#Example 1:
+----------+-----------------+
| patientID| diagnosis_code|
+----------+-----------------+
| A51| XIII|
| B22| VI|
| B13| XV|
+----------+-----------------+
df_B.show(3)
#Example 2:
+-----------+----------+-------+-------------+--------+
| patientID| hospital| city| doctor_name| Bill|
+-----------+----------+-------+-------------+--------+
| A51| Royal H| London|C.Braithwaite| 451.23|
| B22|Surgery K.| Leeds| J.Small| 88.00|
| B22|Surgery K.| Leeds| J.Small| 102.01|
+-----------+----------+-------+-------------+--------+
print("Number of partitions: {}".format(df_A.rdd.getNumPartitions()))# Num of partitions: 1
print("Partitioner: {}".format(df_A.rdd.partitioner)) # Partitioner: None
Number of partitions: 1 #With other DataFrames I get more partitions.
Partitioner: None
After reading all these .csv/.txt
files into DataFrames
, I can see that for some DataFrames
the data is distributed on just 1 partition
(like above), but for others DataFrames
, there could be more partitions, depending upon the size of the corresponding .csv/.txt
file, which in turn influences number of blocks created (128 MB default size in HDFS
). We also don't have a partitioner
at the moment.
Question: Now, is it not a good idea to redistribute these DataFrames
on multiple partitions
, hashed
on the basis of patientID, so that we can avoid as much shuffling
as possible when we join()
these multiple DataFrames
? If indeed, this is what is desired, then should I do repartitioning on patientID basis and have a same partitioner
for all DataFrames
(not sure if it's possible)? I have also read that DataFrame
does everything on it's own, but should we not specify hashing
according to column patientID?
I will really appreciate if someone can provide some useful links or cues on what optimization strategy should one employ when dealing with these multiple DataFrames
, all patientID based.
来源:https://stackoverflow.com/questions/53431989/pyspark-partitioning-and-hashing-multiple-dataframes-then-joining