问题
I'm perplexed between the behaviour of numPartitions
parameter in the following methods:
DataFrameReader.jdbc
Dataset.repartition
The official docs of DataFrameReader.jdbc
say following regarding numPartitions
parameter
numPartitions: the number of partitions. This, along with lowerBound (inclusive), upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the column columnName evenly.
And official docs of Dataset.repartition
say
Returns a new Dataset that has exactly
numPartitions
partitions.
My current understanding:
- The
numPartition
parameter inDataFrameReader.jdbc
method controls the degree of parallelism in reading the data from database - The
numPartition
parameter inDataset.repartition
controls the number of output files that will be generated when thisDataFrame
would be written to disk
My questions:
- If I read
DataFrame
viaDataFrameReader.jdbc
and then write it to disk (without invokingrepartition
method), then would there still be as many files in output as there would've been had I written out aDataFrame
to disk after having invokedrepartition
on it? - If the answer to the above question is:
- Yes: Then is it redundant to invoke
repartition
method on aDataFrame
that was read usingDataFrameReader.jdbc
method (withnumPartitions
parameter)? - No: Then please correct the lapses in my understanding. Also in that case shouldn't the
numPartitions
parameter ofDataFrameReader.jdbc
method be called something like 'parallelism'?
- Yes: Then is it redundant to invoke
回答1:
Short answer: There is (almost) no difference in behaviour of numPartitions
parameter in the two methods
read.jdbc(..numPartitions..)
Here, the numPartitions
parameter controls:
- number of parallel connections that would be made to the
MySQL
(or any otherRDBM
) for reading the data intoDataFrame
. - Degree of parallelism on all subsequent operations on the read
DataFrame
including writing to disk untilrepartition
method is invoked on it
repartition(..numPartitions..)
Here numPartitions
parameter controls the degree of parallelism that would be exhibited in performing any operation of the DataFrame
, including writing to disk.
So basically the DataFrame
obtained on reading MySQL
table using spark.read.jdbc(..numPartitions..)
method behaves the same (exhibits the same degree of parallelism in operations performed over it) as if it was read without parallelism and the repartition(..numPartitions..)
method was invoked on it afterwards (obviously with same value of numPartitions
)
To answer to exact questions:
If I read DataFrame via DataFrameReader.jdbc and then write it to disk (without invoking repartition method), then would there still be as many files in output as there would've been had I written out a DataFrame to disk after having invoked repartition on it?
Yes
Assuming that the read task had been parallelized by providing appropriate parameters (columnName
, lowerBound
, upperBound
& numPartitions
), all operations on the resulting DataFrame
including write will be performed in parallel. Quoting the official docs here:
numPartitions: The maximum number of partitions that can be used for parallelism in table reading and writing. This also determines the maximum number of concurrent JDBC connections. If the number of partitions to write exceeds this limit, we decrease it to this limit by calling coalesce(numPartitions) before writing.
Yes: Then is it redundant to invoke repartition method on a DataFrame that was read using DataFrameReader.jdbc method (with numPartitions parameter)?
Yes
Unless you invoke the other variations of repartition
method (the ones that take columnExprs
param), invoking repartition
on such a DataFrame
(with same numPartitions
) parameter is redundant. However, I'm not sure if forcing same degree of parallelism on an already-parallelized DataFrame
also invokes shuffling of data among executors
unnecessarily. Will update the answer once I come across it.
来源:https://stackoverflow.com/questions/48276241/spark-difference-between-numpartitions-in-read-jdbc-numpartitions-and-repa