问题
I've read several posts on using the "like" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a "best-practice" on using %s in the desired condition as follows:
input_path = <s3_location_str>
my_expr = "Arizona.*hot" # a regex expression
dx = sqlContext.read.parquet(input_path) # "keyword" is a field in dx
# is the following correct?
substr = "'%%%s%%'" %my_keyword # escape % via %% to get "%"
dk = dx.filter("keyword like %s" %substr)
# dk should contain rows with keyword values such as "Arizona is hot."
Note
I'm trying to get all rows in dx that contain the expression my_keyword. Otherwise, for exact matches we wouldn't need surrounding percent signs '%'.
回答1:
From neeraj's hint, it seems like the correct way to do this in pyspark is:
expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))
Note that dx.filter($"keyword" ...)
did not work since (my version) of pyspark didn't seem to support the $
nomenclature out of the box.
回答2:
Try rlike function as mentioned below.
df.filter(<column_name> rlike "<regex_pattern>")
for example.
dk = dx.filter($"keyword" rlike "<pattern>")
回答3:
I used the following for the timestamp regex
expression = r'[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):[0-5][0-9]:[0-5][0-9]'
df1 = df.filter(df['eta'].rlike(expression))
来源:https://stackoverflow.com/questions/45580057/pyspark-filter-dataframe-by-regex-with-string-formatting