Dataframe sample in Apache spark | Scala

后端未结

关注

 7  2066

I\'m trying to take out samples from two dataframes wherein I need the ratio of count maintained. eg

df1.count() = 10
df2.count() = 1000

noOfSamples = 10


                      
              相关标签:


      
      
        
          7条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  一生所求        
                
              
                            
                2020-12-05 07:44
              
            
            
                                                                       
May be you want to try below code..

val splits = data.randomSplit(Array(0.7, 0.3))
val (trainingData, testData) = (splits(0), splits(1))

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  粉色の甜心        
                
              
                            
                2020-12-05 07:46
              
            
            
                                                                       
I too find lack of sample by count functionality disturbing. If you are not picky about creating a temp view I find the code below useful (df is your dataframe, count is sample size):

val tableName = s"table_to_sample_${System.currentTimeMillis}"
df.createOrReplaceTempView(tableName)
val sampled = sqlContext.sql(s"select *, rand() as random from ${tableName} order by random limit ${count}")
sqlContext.dropTempTable(tableName)
sampled.drop("random")


It returns an exact count as long as your current row count is as large as your sample size. 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  春和景丽        
                
              
                            
                2020-12-05 07:56
              
            
            
                                                                       
I use this function for random sampling when exact number of records are desirable:

def row_count_sample (df, row_count, with_replacement=False, random_seed=113170):

    ratio = 1.08 * float(row_count) / df.count()  # random-sample more as dataframe.sample() is not a guaranteed to give exact record count
                                                  # it could be more or less actual number of records returned by df.sample()

    if ratio>1.0:
        ratio = 1.0

    result_df = (df
                    .sample(with_replacement, ratio, random_seed)
                    .limit(row_count)                                   # since we oversampled, make exact row count here
                )

    return result_df 

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  陌清茗        
                
              
                            
                2020-12-05 07:57
              
            
            
                                                                       
To answer if the fraction can be greater than 1. Yes, it can be if we have replace as yes. If a value greater than 1 is provided with replace false, then following exception will occur:

java.lang.IllegalArgumentException: requirement failed: Upper bound (2.0) must be <= 1.0.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  灰色年华        
                
              
                            
                2020-12-05 07:59
              
            
            
                                                                       
The below code works if you want to do a random split of 70% & 30% of a data frame df,

val Array(trainingDF, testDF) = df.randomSplit(Array(0.7, 0.3), seed = 12345)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  孤独总比滥情好        
                
              
                            
                2020-12-05 08:01
              
            
            
                                                                       
The fraction parameter represents the aproximate fraction of the dataset that will be returned. For instance, if you set it to 0.1, 10% (1/10) of the rows will be returned. For your case, I believe you want to do the following:

val newSample = df1.sample(true, 1D*noOfSamples/df1.count)


However, you may notice that newSample.count will return a different number each time you run it, and that's because the fraction will be a threshold for a random-generated value (as you can see here), so the resulting dataset size can vary. An workaround can be:

val newSample = df1.sample(true, 2D*noOfSamples/df1.count).limit(df1.count/noOfSamples)


Some scalability observations

You may note that doing a df1.count might be expensive as it evaluates the whole DataFrame, and you'll lose one of the benefits of sampling in the first place. 

Therefore depending on the context of your application, you may want to use an already known number of total samples, or an approximation.

val newSample = df1.sample(true, 1D*noOfSamples/knownNoOfSamples)


Or assuming the size of your DataFrame as huge, I would still use a fraction and use limit to force the number of samples.

val guessedFraction = 0.1
val newSample = df1.sample(true, guessedFraction).limit(noOfSamples)


As for your questions:


  can it be greater than 1?


No. It represents a fraction between 0 and 1. If you set it to 1 it will bring 100% of the rows, so it wouldn't make sense to set it to a number larger than 1.


  Also is there anyway we can specify the number of rows to be sampled?


You can specify a larger fraction than the number of rows you want and then use limit, as I show in the second example. Maybe there is another way, but this is the approach I use.
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
   
          
     1
2
下一页
           
           
        
                                  
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复