Split one column based the value of another column in pyspark [duplicate]

懵懂的女人 提交于 2020-01-03 00:56:08

问题


I have the following data frame

+----+-------+
|item|   path|
+----+-------+
|   a|  a/b/c|
|   b|  e/b/f|
|   d|e/b/d/h|
|   c|  g/h/c|
+----+-------+

I want to find relative path of an of the column "item" by locating its value in column 'path' and extracting the path's LHS as shown below

+----+-------+--------+
|item|   path|rel_path|
+----+-------+--------+
|   a|  a/b/c|       a|
|   b|  e/b/f|     e/b|
|   d|e/b/d/h|   e/b/d|
|   c|  g/h/c|   g/h/c|
+----+-------+--------+

I tried to use functions split((str, pattern) or regexp_extract(str, pattern, idx) but not sure how to pass the value of column 'item' into their pattern section . Any idea how that could be done without writing a function?


回答1:


You can use pyspark.sql.functions.expr to pass a column value as a parameter to regexp_replace. Here you need to concatenate the a negative lookbehind for item with .+ to match everything after, and replace with an empty string.

from pyspark.sql.functions import expr

df.withColumn(
    "rel_path", 
    expr("regexp_replace(path, concat('(?<=',item,').+'), '')")
).show()
#+----+-------+--------+
#|item|   path|rel_path|
#+----+-------+--------+
#|   a|  a/b/c|       a|
#|   b|  e/b/f|     e/b|
#|   d|e/b/d/h|   e/b/d|
#|   c|  g/h/c|   g/h/c|
#+----+-------+--------+



回答2:


You can use get the desired result with combination of substring and instr

substring - Get subset from a column/string

instr - Identify the location of particular pattern in search string.

df = spark.createDataFrame([('a','a/b/c'),
                            ('b','e/b/f'),
                            ('d','e/b/d/h'),
                            ('c','g/h/c')],'item : string , path : string')

from pyspark.sql.functions import expr, instr, substring

df.withColumn("rel_path",expr("substring(path, 1, (instr(path,item)))")).show()

##+----+-------+--------+
##|item|   path|rel_path|
##+----+-------+--------+
##|   a|  a/b/c|       a|
##|   b|  e/b/f|     e/b|
##|   d|e/b/d/h|   e/b/d|
##|   c|  g/h/c|   g/h/c|
##+----+-------+--------+


来源:https://stackoverflow.com/questions/55577043/split-one-column-based-the-value-of-another-column-in-pyspark

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!