In pyspark, how do you add/concat a string to a column?

前端 未结 2 780
生来不讨喜
生来不讨喜 2020-12-17 18:05

I would like to add a string to an existing column. For example, df[\'col1\'] has values as \'1\', \'2\', \'3\' etc and I would like to concat stri

相关标签:
2条回答
  • 2020-12-17 19:01
    from pyspark.sql.functions import concat, col, lit
    
    
    df.select(concat(col("firstname"), lit(" "), col("lastname"))).show(5)
    +------------------------------+
    |concat(firstname,  , lastname)|
    +------------------------------+
    |                Emanuel Panton|
    |              Eloisa Cayouette|
    |                   Cathi Prins|
    |             Mitchel Mozdzierz|
    |               Angla Hartzheim|
    +------------------------------+
    only showing top 5 rows
    

    http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#module-pyspark.sql.functions

    0 讨论(0)
  • 2020-12-17 19:02

    Another option here is to use pyspark.sql.functions.format_string() which allows you to use C printf style formatting.

    Here's an example where the values in the column are integers.

    import pyspark.sql.functions as f
    df = sqlCtx.createDataFrame([(1,), (2,), (3,), (10,), (100,)], ["col1"])
    df.withColumn("col2", f.format_string("%03d", "col1")).show()
    #+----+----+
    #|col1|col2|
    #+----+----+
    #|   1| 001|
    #|   2| 002|
    #|   3| 003|
    #|  10| 010|
    #| 100| 100|
    #+----+----+
    

    Here the format "%03d" means print an integer number left padded with up to 3 zeros. This is why the 10 gets mapped to 010 and 100 does not change at all.

    Or if you wanted to add exactly 3 zeros in the front:

    df.withColumn("col2", f.format_string("000%d", "col1")).show()
    #+----+------+
    #|col1|  col2|
    #+----+------+
    #|   1|  0001|
    #|   2|  0002|
    #|   3|  0003|
    #|  10| 00010|
    #| 100|000100|
    #+----+------+
    
    0 讨论(0)
提交回复
热议问题