What is the meaning of partitionColumn, lowerBound, upperBound, numPartitions parameters?

后端 未结 4 1736
名媛妹妹
名媛妹妹 2020-11-27 05:05

While fetching data from SQL Server via a JDBC connection in Spark, I found that I can set some parallelization parameters like partitionColumn, lowerBoun

相关标签:
4条回答
  • 2020-11-27 05:12

    Would just like to add to the verified answer since the words,

    Without them you would loose some data is misleading..

    From the documentation, Notice that lowerBound and upperBound are just used to decide the partition stride, not for filtering the rows in table. So all rows in the table will be partitioned and returned. This option applies only to reading.

    Which means say your table has a 1100 rows, and you specify

    lowerBound 0

    upperBound 1000 and

    numPartitions: 10 , you won't loose the 1000 to 1100 rows. You'll just end up with some of the partitions having more rows than intended instead.(the stride value is 100).

    0 讨论(0)
  • 2020-11-27 05:31

    It is simple:

    • partitionColumn is a column which should be used to determine partitions.
    • lowerBound and upperBound determine range of values to be fetched. Complete dataset will use rows corresponding to the following query:

      SELECT * FROM table WHERE partitionColumn BETWEEN lowerBound AND upperBound
      
    • numPartitions determines number of partitions to be created. Range between lowerBound and upperBound is divided into numPartitions each with stride equal to:

      upperBound / numPartitions - lowerBound / numPartitions
      

      For example if:

      • lowerBound: 0
      • upperBound: 1000
      • numPartitions: 10

      Stride is equal to 100 and partitions correspond to following queries:

      • SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100
      • SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200
      • ...
      • SELECT * FROM table WHERE partitionColumn BETWEEN 900 AND 1000
    0 讨论(0)
  • 2020-11-27 05:31

    Creating partitions doesn't result in loss of data due to filtering. The upperBound, lowerbound along with numPartitions just defines how the partitions are to be created. The upperBound and lowerbound don't define the range (filter) for the values of the partitionColumn to be fetched.

    For a given input of lowerBound (l), upperBound (u) and numPartitions (n) 
    The partitions are created as follows:
    
    stride, s= (u-l)/n
    
    **SELECT * FROM table WHERE partitionColumn < l+s or partitionColumn is null**
    SELECT * FROM table WHERE partitionColumn >= l+s AND <2s  
    SELECT * FROM table WHERE partitionColumn >= l+2s AND <3s
    ...
    **SELECT * FROM table WHERE partitionColumn >= l+(n-1)s**
    

    For instance, for upperBound = 500, lowerBound = 0 and numPartitions = 5. The partitions will be as per the following queries:

    SELECT * FROM table WHERE partitionColumn < 100 or partitionColumn is null
    SELECT * FROM table WHERE partitionColumn >= 100 AND <200 
    SELECT * FROM table WHERE partitionColumn >= 200 AND <300
    SELECT * FROM table WHERE partitionColumn >= 300 AND <400
    ...
    SELECT * FROM table WHERE partitionColumn >= 400
    

    Depending on the actual range of values of the partitionColumn, the result size of each partition will vary.

    0 讨论(0)
  • 2020-11-27 05:33

    Actually the list above misses a couple of things, specifically the first and the last query.

    Without them you would loose some data (the data before the lowerBound and that after upperBound). From the example is not clear because the lower bound is 0.

    The complete list should be:

    SELECT * FROM table WHERE partitionColumn < 100
    
    SELECT * FROM table WHERE partitionColumn BETWEEN 0 AND 100  
    SELECT * FROM table WHERE partitionColumn BETWEEN 100 AND 200  
    

    ...

    SELECT * FROM table WHERE partitionColumn > 9000
    
    0 讨论(0)
提交回复
热议问题