Pyspark alter column with substring

北城以北 提交于 2020-04-29 12:13:06

问题


Pyspark n00b... How do I replace a column with a substring of itself? I'm trying to remove a select number of characters from the start and end of string.

from pyspark.sql.functions import substring
import pandas as pd
pdf = pd.DataFrame({'COLUMN_NAME':['_string_','_another string_']})
# this is what i'm looking for...
pdf['COLUMN_NAME_fix']=pdf['COLUMN_NAME'].str[1:-1] 

df = sqlContext.createDataFrame(pdf)
# following not working... COLUMN_NAME_fix is blank
df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1)).show() 

This is pretty close but slightly different Spark Dataframe column with last character of other column. And then there is this LEFT and RIGHT function in PySpark SQL


回答1:


pyspark.sql.functions.substring(str, pos, len)

Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

In your code,

df.withColumn('COLUMN_NAME_fix', substring('COLUMN_NAME', 1, -1))
1 is pos and -1 becomes len, length can't be -1 and so it returns null

Try this, (with fixed syntax)

from pyspark.sql.types import StringType
from pyspark.sql.functions import udf

udf1 = udf(lambda x:x[1:-1],StringType())
df.withColumn('COLUMN_NAME_fix',udf1('COLUMN_NAME')).show()



回答2:


try:

df.withColumn('COLUMN_NAME_fix', df['COLUMN_NAME'].substr(1, 10)).show()

where 1 = start position in the string and 10 = number of characters to include from start position (inclusive)



来源:https://stackoverflow.com/questions/46750201/pyspark-alter-column-with-substring

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!