问题
I have the following data frame
+----+-------+
|item| path|
+----+-------+
| a| a/b/c|
| b| e/b/f|
| d|e/b/d/h|
| c| g/h/c|
+----+-------+
I want to find relative path of an of the column "item"
by locating its value in column 'path'
and extracting the path's LHS as shown below
+----+-------+--------+
|item| path|rel_path|
+----+-------+--------+
| a| a/b/c| a|
| b| e/b/f| e/b|
| d|e/b/d/h| e/b/d|
| c| g/h/c| g/h/c|
+----+-------+--------+
I tried to use functions split((str, pattern)
or regexp_extract(str, pattern, idx)
but not sure how to pass the value of column 'item'
into their pattern section . Any idea how that could be done without writing a function?
回答1:
You can use pyspark.sql.functions.expr to pass a column value as a parameter to regexp_replace. Here you need to concatenate the a negative lookbehind for item
with .+
to match everything after, and replace with an empty string.
from pyspark.sql.functions import expr
df.withColumn(
"rel_path",
expr("regexp_replace(path, concat('(?<=',item,').+'), '')")
).show()
#+----+-------+--------+
#|item| path|rel_path|
#+----+-------+--------+
#| a| a/b/c| a|
#| b| e/b/f| e/b|
#| d|e/b/d/h| e/b/d|
#| c| g/h/c| g/h/c|
#+----+-------+--------+
回答2:
You can use get the desired result with combination of substring
and instr
substring
- Get subset from a column/string
instr
- Identify the location of particular pattern in search string.
df = spark.createDataFrame([('a','a/b/c'),
('b','e/b/f'),
('d','e/b/d/h'),
('c','g/h/c')],'item : string , path : string')
from pyspark.sql.functions import expr, instr, substring
df.withColumn("rel_path",expr("substring(path, 1, (instr(path,item)))")).show()
##+----+-------+--------+
##|item| path|rel_path|
##+----+-------+--------+
##| a| a/b/c| a|
##| b| e/b/f| e/b|
##| d|e/b/d/h| e/b/d|
##| c| g/h/c| g/h/c|
##+----+-------+--------+
来源:https://stackoverflow.com/questions/55577043/split-one-column-based-the-value-of-another-column-in-pyspark