Speed up millions of regex replacements in Python 3

后端 未结 9 1213
醉酒成梦
醉酒成梦 2020-11-22 05:44

I\'m using Python 3.5.2

I have two lists

  • a list of about 750,000 \"sentences\" (long strings)
  • a list of about 20,000 \"words\" that I would l
9条回答
  •  悲&欢浪女
    2020-11-22 06:09

    Perhaps Python is not the right tool here. Here is one with the Unix toolchain

    sed G file         |
    tr ' ' '\n'        |
    grep -vf blacklist |
    awk -v RS= -v OFS=' ' '{$1=$1}1'
    

    assuming your blacklist file is preprocessed with the word boundaries added. The steps are: convert the file to double spaced, split each sentence to one word per line, mass delete the blacklist words from the file, and merge back the lines.

    This should run at least an order of magnitude faster.

    For preprocessing the blacklist file from words (one word per line)

    sed 's/.*/\\b&\\b/' words > blacklist
    

提交回复
热议问题