Pyspark : select specific column with its position

前端未结

关注

 2  735

I would like to know how to select a specific column with its number but not with its name in a dataframe ?

Like this in Pandas:

df = df.iloc[:,2]


                      
              相关标签:


      
      
        
          2条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  野趣味        
                
              
                            
                2021-01-18 09:18
              
            
            
                                                                       
You can always get the name of the column with df.columns[n] and then select it:

df = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])


To select column at position n:

n = 1
df.select(df.columns[n]).show()
+---+                                                                           
|  b|
+---+
|  2|
|  4|
+---+


To select all but column n:

n = 1


You can either use drop:

df.drop(df.columns[n]).show()
+---+
|  a|
+---+
|  1|
|  3|
+---+


Or select with manually constructed column names:

df.select(df.columns[:n] + df.columns[n+1:]).show()
+---+
|  a|
+---+
|  1|
|  3|
+---+

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  一向        
                
              
                            
                2021-01-18 09:27
              
            
            
                                                                       
Same solution as mirkhosro:
For a dataframe df, you can select the column n using df[n], where n is the index of the column.
Example:
df = df.filter(df[3]!=0)

will remove the rows of df, where the value in the fourth column is 0.
Note that you can check the columns using df.printSchema()
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复