How do I count all occurrences of a phrase in a text file using regular expressions?

问题

I am reading in multiple files from a directory and attempting to find how many times a specific phrase (in this instance "at least") occurs in each file (not just that it occurs, but how many times in each text file it occurs) My code is as follows

import glob
import os

path = 'D:/Test'

k = 0

for filename in glob.glob(os.path.join(path, '*.txt')):
    if filename.endswith('.txt'):
        f = open(filename)
        data = f.read()
        data.split()
        data.lower()
        S = re.findall(r' at least ', data, re.MULTILINE)
        count = []
        if S == True:
         for S in data:
          count.append(data.count(S))
          k= k + 1
          print("'{}' match".format(filename), count)
        else:
         print("'{}' no match".format(filename))
print("Total number of matches", k)

At this moment I get no matches at all. I can count whether or not there is an occurrence of the phrase but am not sure why I can't get a count of all occurrences in each text file.

Any help would be appreciated.

regards

回答1:

You can get rid of the regex entirely, the count-method of string objects is enough, much of the other code can be simplified as well.

You're also not changing data to lower case, just printing the string as lower case, note how I use data = data.lower() to actually change the variable.

Try this code:

import glob
import os

path = 'c:\script\lab\Tests'

k = 0

substring = ' at least '
for filename in glob.glob(os.path.join(path, '*.txt')):
    if filename.endswith('.txt'):
        f = open(filename)
        data = f.read()
        data = data.lower()
        S= data.count(substring)
        if S:
            k= k + 1
            print("'{}' match".format(filename), S)
        else:
            print("'{}' no match".format(filename))
print("Total number of matches", k)

If anything is unclear feel free to ask!

回答2:

You make multiple mistakes in your code. data.split() and data.lower() have no effect at all, since the both do not modifiy data but return a modified version. However, you don't assign the return value to anything, so it is lost. Also, you should always close a resource (e.g. a file) when you don't need it anymore.

Also, you append every string you find using re.search to a list S, which you dont use for anything anymore. It would also be pointless, because it would just contain the string you are looking for x amount of time. You can just take the list that is returned by re.search and comupute its length. This gives you the number of times it occurs in the text. Then you just increase your counter variable k by that amount and move on to the next file. You can still have your print statements by simply printing the temporary num_found variable.

import re
import glob
import os

path = 'D:/Test'

k = 0

for filename in glob.glob(os.path.join(path, '*.txt')):
    if filename.endswith('.txt'):
        f = open(filename)
        text = f.read()
        f.close()
        num_found = len(re.findall(r' at least ', data, re.MULTILINE))
        k += num_found

来源：https://stackoverflow.com/questions/65295940/how-do-i-count-all-occurrences-of-a-phrase-in-a-text-file-using-regular-expressi

标签

python-3.x

regex

nlp