Pyspark replace strings in Spark dataframe column

后端 未结 2 1479
轮回少年
轮回少年 2020-12-02 20:23

I\'d like to perform some basic stemming on a Spark Dataframe column by replacing substrings. What\'s the quickest way to do this?

In my current use case, I have a

相关标签:
2条回答
  • 2020-12-02 20:43

    For scala

    import org.apache.spark.sql.functions.regexp_replace
    import org.apache.spark.sql.functions.col
    data.withColumn("addr_new", regexp_replace(col("addr_line"), "\\*", ""))
    
    0 讨论(0)
  • 2020-12-02 20:49

    For Spark 1.5 or later, you can use the functions package:

    from pyspark.sql.functions import *
    newDf = df.withColumn('address', regexp_replace('address', 'lane', 'ln'))
    

    Quick explanation:

    • The function withColumn is called to add (or replace, if the name exists) a column to the data frame.
    • The function regexp_replace will generate a new column by replacing all substrings that match the pattern.
    0 讨论(0)
提交回复
热议问题