问题

I'm very new in programming and regex so apologise if this's been asked before (I didn't find one, though).

I want to use Python to summarise word frequencies in a literal text. Let's assume the text is formatted like

Chapter 1
blah blah blah

Chapter 2
blah blah blah
....

Now I read the text as a string, and I want to use re.findall to get every word in this text, so my code is

wordlist = re.findall(r'\b\w+\b', text)

But the problem is that it matches all these Chapters in each chapter title, which I don't want to include in my stats. So I want to ignore what matches Chapter\s*\d+. What should I do?

Thanks in advance, guys.

回答1:

Solutions

You could remove all Chapter+space+digits first :

wordlist = re.findall(r'\b\w+\b', re.sub(r'Chapter\s*\d+\s*','',text))

If you want to use just one search , you can use a negative lookahead to find any word that isn't preceded by "Chapter X" and does not begin with a digit :

wordlist = re.findall(r'\b(?!Chapter\s+\d+)[A-Za-z]\w*\b',text)

If performance is an issue, loading a huge string and parsing it with a Regex wouldn't be the correct method anyway. Just read the file line by line, toss any line that matches r'^Chapter\s*\d+' and parse each remaining line separately with r'\b\w+\b' :

import re

lines=open("huge_file.txt", "r").readlines()

wordlist = []
chapter = re.compile(r'^Chapter\s*\d+')
words = re.compile(r'\b\w+\b')
for line in lines:
  if not chapter.match(line):
    wordlist.extend(words.findall(line))

print len(wordlist)

Performance

I wrote a small ruby script to write a huge file :

all_dicts = Dir["/usr/share/dict/*"].map{|dict|
  File.readlines(dict)
}.flatten

File.open('huge_file.txt','w+') do |txt|
  newline=true
  txt.puts "Chapter #{rand(1000)}"
  50_000_000.times do
    if rand<0.05
      txt.puts
      txt.puts
      txt.puts "Chapter #{rand(1000)}"
      newline = true
    end
    txt.write " " unless newline
    newline = false
    txt.write all_dicts.sample.chomp
    if rand<0.10
      txt.puts
      newline = true
    end
  end
end

The resulting file has more than 50 million words and is about 483MB big :

Chapter 154
schoolyard trashcan's holly's continuations

Chapter 814
assure sect's Trippe's bisexuality inexperience
Dumbledore's cafeteria's rubdown hamlet Xi'an guillotine tract concave afflicts amenity hurriedly whistled
Carranza
loudest cloudburst's

Chapter 142
spender's
vests
Ladoga

Chapter 896
petition's Vijayawada Lila faucets
addendum Monticello swiftness's plunder's outrage Lenny tractor figure astrakhan etiology's
coffeehouse erroneously Max platinum's catbird succumbed nonetheless Nissan Yankees solicitor turmeric's regenerate foulness firefight
spyglass
disembarkation athletics drumsticks Dewey's clematises tightness tepid kaleidoscope Sadducee Cheerios's

The two-step process took 12.2s to extract the wordlist on average, the lookahead method took 13.5s and Wiktor's answer also took 13.5s. The lookahead method I first wrote used re.IGNORECASE, and it took around 18s.

There's basically no difference in performance between all the Regexen methods when reading the whole file.

What surprised me though is that the readlines script took around 20.5s, and didn't use much less memory than the other scripts. If you have any idea how to improve the script, please comment!

回答2:

Match what you do not need and capture what you need, and use this technique with re.findall that only returns captured values:

re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)

Details:

\bChapter\s*\d+\b - Chapter as a whole word followed with 0+ whitespaces and 1+ digits
| - or
\b(\w+)\b - match and capture into Group 1 one or more word chars

To avoid getting empty values in the resulting list, filter it (see demo):

import re
s = "Chapter 1: Black brown fox 45"
print(filter(None, re.findall(r'\bChapter\s*\d+\b|\b(\w+)\b',s)))

来源：https://stackoverflow.com/questions/40690720/how-to-make-exceptions-for-certain-words-in-regex

标签

python

regex

regex-negation

How to make exceptions for certain words in regex

问题

回答1:

Solutions

Performance

回答2: