发表新帖

发表新帖

Speed up millions of regex replacements in Python 3

后端未结

关注

 9  1213

醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

a list of about 750,000 \"sentences\" (long strings)
a list of about 20,000 \"words\" that I would l

9条回答

悲&欢浪女 (楼主)

2020-11-22 06:09
Perhaps Python is not the right tool here. Here is one with the Unix toolchain
```
sed G file         |
tr ' ' '\n'        |
grep -vf blacklist |
awk -v RS= -v OFS=' ' '{$1=$1}1'
```
assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

This should run at least an order of magnitude faster.

For preprocessing the blacklist file from words (one word per line)
```
sed 's/.*/\\b&\\b/' words > blacklist
```
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...

热议问题