Pyspark: filter dataframe by regex with string formatting?

后端未结

关注

 3  1692

I\'ve read several posts on using the \"like\" operator to filter a spark dataframe by the condition of containing a string/expression, but was wondering if the following is a \

相关标签:

3条回答

抹茶落季

2021-02-01 06:39

I used the following for the timestamp regex

expression = r'[0-9]{4}-(0[1-9]|1[0-2])-(0[1-9]|[1-2][0-9]|3[0-1]) (2[0-3]|[01][0-9]):[0-5][0-9]:[0-5][0-9]'
df1 = df.filter(df['eta'].rlike(expression))

0 讨论(0)

情话喂你

2021-02-01 06:48
Try rlike function as mentioned below.
```
df.filter(<column_name> rlike "<regex_pattern>")
```
for example.
```
dk = dx.filter($"keyword" rlike "<pattern>")
```
0 讨论(0)
发布评论:

提交评论
- 加载中...
北恋

2021-02-01 07:01
From neeraj's hint, it seems like the correct way to do this in pyspark is:
```
expr = "Arizona.*hot"
dk = dx.filter(dx["keyword"].rlike(expr))
```
Note that dx.filter($"keyword" ...) did not work since (my version) of pyspark didn't seem to support the $ nomenclature out of the box.
0 讨论(0)
发布评论:

提交评论
- 加载中...