Pyspark alter column with substring

前端未结

关注

 4  1882

Pyspark n00b... How do I replace a column with a substring of itself? I\'m trying to remove a select number of characters from the start and end of string.


                      
              相关标签:


      
      
        
          4条回答        

        
                         				            
            
           
            
                              
                
              
              
                
                  执笔经年        
                
              
                            
                2021-01-04 08:50
              
            
            
                                                                       
The accepted answer uses a udf (user defined function), which is usually (much) slower than native spark code. Grant Shannon's answer does use native spark code, but as noted in the comments by citynorman, it is not 100% clear how this works for variable string lengths.
Answer with native spark code (no udf) and variable string length
From the documentation of substr in pyspark, we can see that the arguments: startPos and length can be either int or Column types (both must be the same type). So we just need to create a column that contains the string length and use that as argument.
import pyspark.sql.functions as sf

result = (
    df
    .withColumn('length', sf.length('COLUMN_NAME'))
    .withColumn('fixed_in_spark', col('COLUMN_NAME').substr(sf.lit(2), col('length') - sf.lit(2)))
)

# result:
+----------------+---------------+----+--------------+
|     COLUMN_NAME|COLUMN_NAME_fix|size|fixed_in_spark|
+----------------+---------------+----+--------------+
|        _string_|         string|   8|        string|
|_another string_| another string|  16|another string|
+----------------+---------------+----+--------------+

Note:

We use length - 2 because we start from the second character (and need everything up to the 2nd last).
We need to use sf.lit because we cannot add (or subtract) a number to a Column object. We need to first convert that number into a Column.

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  南笙        
                
              
                            
                2021-01-04 09:01
              
            
            
                                                                       
try:

df.withColumn('COLUMN_NAME_fix', df['COLUMN_NAME'].substr(1, 10)).show()


where 1 = start position in the string and 
     10 = number of characters to include from start position (inclusive) 
                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  庸人自扰        
                
              
                            
                2021-01-04 09:07
              
            
            
                                                                       

  pyspark.sql.functions.substring(str, pos, len)
  
  Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type


In your code,

df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null


Try this, (with fixed syntax)

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
            
           
            
                              
                
              
              
                
                  被撕碎了的回忆        
                
              
                            
                2021-01-04 09:09
              
            
            
                                                                       
if the goal is to remove '_' from the column names then I would use list comprehension instead:
df.select(
    [ col(c).alias(c.replace('_', '') ) for c in df.columns ]
)

                                                                        
                                                        
            
            
              
                
                0
              
                 
                
               讨论(0)
              
              
                                                   
              
                                                            
            
                      
                    


               
            
    发布评论:
    
         
                        
    
    提交评论 
  
  

                    
                    
                    
                        
                        
                         加载中...
                        
                    
                
          
          	          
                             
        
        
          
            
            
              
              
            
    


                                 
              
            
                          
    

        
         
                验证码
                
                  
                
                
                   看不清?
                
              
                                  
                    
   
                 
             
              提交回复