Parallel file matching, Python

I am trying to improve on a script which scans files for malicious code. We have a list of regex patterns in a file, one pattern on each line. These regex are for grep as our current implementation is basically a bash script find\grep combo. The bash script takes 358 seconds on my benchmark directory. I was able to write a python script that did this in 72 seconds but want to improve more. First I will post the base-code and then tweaks I have tried:

import os, sys, Queue, threading, re

fileList = []
rootDir = sys.argv[1]

class Recurser(threading.Thread):

    def __init__(self, queue, dir):
    self.queue = queue
    self.dir = dir
    threading.Thread.__init__(self)

    def run(self):
    self.addToQueue(self.dir)

    ## HELPER FUNCTION FOR INTERNAL USE ONLY
    def addToQueue(self,  rootDir):
      for root, subFolders, files in os.walk(rootDir):
    for file in files:
       self.queue.put(os.path.join(root,file))
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)
      self.queue.put(-1)

class Scanner(threading.Thread):

    def __init__(self, queue, patterns):
    self.queue = queue
    self.patterns = patterns
    threading.Thread.__init__(self)

    def run(self):
    nextFile = self.queue.get()
    while nextFile is not -1:
       #print "Trying " + nextFile
       self.scanFile(nextFile)
       nextFile = self.queue.get()


    #HELPER FUNCTION FOR INTERNAL UES ONLY
    def scanFile(self, file):
       fp = open(file)
       contents = fp.read()
       i=0
       #for patt in self.patterns:
       if self.patterns.search(contents):
      print "Match " + str(i) + " found in " + file

############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################
############MAIN MAIN MAIN MAIN##################


fileQueue = Queue.Queue()

#Get the shell scanner patterns
patterns = []
fPatt = open('/root/patterns')
giantRE = '('
for line in fPatt:
   #patterns.append(re.compile(line.rstrip(), re.IGNORECASE))
   giantRE = giantRE + line.rstrip() + '|'

giantRE = giantRE[:-1] + ')'
giantRE = re.compile(giantRE, re.IGNORECASE)

#start recursing the directories
recurser = Recurser(fileQueue,rootDir)
recurser.start()

print "starting scanner"
#start checking the files
for scanner in xrange(0,8):
   scanner = Scanner(fileQueue, giantRE)
   scanner.start()

This is obviously debugging\ugly code, do not mind the million queue.put(-1), I will clean this up later. Some indentations are not showing up properly, paticularly in scanFile.

Anyway some things I've noticed. Using 1, 4, and even 8 threads (for scanner in xrange(0,???):) does not make a difference. I still get ~72 seconds regardless. I assume this is due to python's GIL.

As opposed to making a giant regex I tried placing each line (pattern) as a compilex RE in a list and iterating through this list in my scanfile function. This resulted in longer execution time.

In an effort to avoid python's GIL I tried having each thread fork to grep as in:

#HELPER FUNCTION FOR INTERNAL UES ONLY
def scanFile(self, file):
      s = subprocess.Popen(("grep", "-El", "--file=/root/patterns", file), stdout = subprocess.PIPE)
      output = s.communicate()[0]
      if output != '':
         print 'Matchfound in ' + file

This resulted in longer execution time.

Any suggestions on improving performance.

:::::::::::::EDIT::::::::

I can not post answers to my own questions yet however here are the answers to several points raised:

@David Nehme - Just to let people know I am aware of the fact that I have a million queue.put(-1)'s

@Blender - To mark the bottom of the queue. My scanner threads keep dequeing until they hit -1 which is at the bottom (while nextFile is not -1:). The processor cores is 8 however due to the GIL using 1 thread, 4 threads, or 8 threads does NOT make a difference. Spawning 8 subprocesses resulted in significantly slower code (142 sec vs 72)

@ed - Yes that and it's just as slow as the find\grep combo, actually slower because it indiscriminately greps file that aren't needed

@Ron - Can't upgrade, this must be universal. Do you think this will speed up > 72 seconds? The bash grepper does 358 seconds. My python giant RE method does 72 seconds w\ 1-8 threads. The popen method w\ 8 thrads (8 subprocesses) ran at 142 seconds. So far the giant RE python only method is a clear winner by far

@intuted

Here's the meat of our current find\grep combo (Not my script). It's pretty simple. There are some additional things in there like ls, but nothing that should result in a 5x slowdown. Even if grep -r is slightly more efficient 5x is a HUGE slowdown.

 find "${TARGET}" -type f -size "${SZLIMIT}" -exec grep -Eaq --file="${HOME}/patterns" "{}" \; -and -ls | tee -a "${HOME}/found.txt"

The python code is more efficient, I don't know why, but I experimentally tested it. I prefer to do this in python. I already achieved a speedup of 5x with python, I would like to get it sped up more.

:::::::::::::WINNER WINNER WINNER:::::::::::::::::

Looks like we have a winner.

intued's shell script comes in 2nd place with 34 seconds however @steveha's came in first with 24 seconds. Due to the fact that a lot of our boxes do not have python2.6 I had to cx_freeze it. I can write a shell script wrapper to wget a tar and unpack it. I do like intued's for simplicity however.

Thanks you for all your help guys, I now have an efficient tool for sysadmining

I think that, rather than using the threading module, you should be using the multiprocessing module for your Python solution. Python threads can run afoul of the GIL; the GIL is not a problem if you simply have multiple Python processes going.

I think that for what you are doing a pool of worker processes is just what you want. By default, the pool will default to one process for each core in your system processor. Just call the .map() method with a list of filenames to check and the function that does the checking.

http://docs.python.org/library/multiprocessing.html

If this is not faster than your threading implementation, then I don't think the GIL is your problem.

EDIT: Okay, I'm adding a working Python program. This uses a pool of worker processes to open each file and search for the pattern in each. When a worker finds a filename that matches, it simply prints it (to standard output) so you can redirect the output of this script into a file and you have your list of files.

EDIT: I think this is a slightly easier to read version, easier to understand.

I timed this, searching through the files in /usr/include on my computer. It completes the search in about half a second. Using find piped through xargs to run as few grep processes as possible, it takes about 0.05 seconds, about a 10x speedup. But I hate the baroque weird language you must use to get find to work properly, and I like the Python version. And perhaps on really big directories the disparity would be smaller, as part of the half-second for Python must have been startup time. And maybe half a second is fast enough for most purposes!

import multiprocessing as mp
import os
import re
import sys

from stat import S_ISREG


# uncomment these if you really want a hard-coded $HOME/patterns file
#home = os.environ.get('HOME')
#patterns_file = os.path.join(home, 'patterns')

target = sys.argv[1]
size_limit = int(sys.argv[2])
assert size_limit >= 0
patterns_file = sys.argv[3]


# build s_pat as string like:  (?:foo|bar|baz)
# This will match any of the sub-patterns foo, bar, or baz
# but the '?:' means Python won't bother to build a "match group".
with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

# pre-compile pattern for speed
pat = re.compile(s_pat)


def walk_files(topdir):
    """yield up full pathname for each file in tree under topdir"""
    for dirpath, dirnames, filenames in os.walk(topdir):
        for fname in filenames:
            pathname = os.path.join(dirpath, fname)
            yield pathname

def files_to_search(topdir):
    """yield up full pathname for only files we want to search"""
    for fname in walk_files(topdir):
        try:
            # if it is a regular file and big enough, we want to search it
            sr = os.stat(fname)
            if S_ISREG(sr.st_mode) and sr.st_size >= size_limit:
                yield fname
        except OSError:
            pass

def worker_search_fn(fname):
    with open(fname, 'rt') as f:
        # read one line at a time from file
        for line in f:
            if re.search(pat, line):
                # found a match! print filename to stdout
                print(fname)
                # stop reading file; just return
                return

mp.Pool().map(worker_search_fn, files_to_search(target))

I'm a bit confused as to how your Python script ended up being faster than your find/grep combo. If you want to use grep in a way somewhat similar to what's suggested by Ron Smith in his answer, you can do something like

find -type f | xargs -d \\n -P 8 -n 100 grep --file=/root/patterns

to launch grep processes which will process 100 files before exiting, keeping up to 8 such processes active at any one time. Having them process 100 files should make the process startup overhead time of each one negligible.

note: The -d \\n option to xargs is a GNU extension which won't work on all POSIX-ish systems. It specifies that the *d*elimiter between filenames is a newline. Although technically filenames can contain newlines, in practice nobody does this and keeps their jobs. For compatibility with non-GNU xargs you need to add the -print0 option to find and use -0 instead of -d \\n with xargs. This will arrange for the null byte \0 (hex 0x00) to be used as the delimiter both by find and xargs.

You could also take the approach of first counting the number of files to be grepped

NUMFILES="$(find -type f | wc -l)";

and then using that number to get an even split among the 8 processes (assuming bash as shell)

find -type f | xargs -d \\n -P 8 -n $(($NUMFILES / 8 + 1)) grep --file=/root/patterns

I think this might work better because the disk I/O of find won't be interfering with the disk I/O of the various greps. I suppose it depends in part on how large the files are, and whether they are stored contiguously — with small files, the disk will be seeking a lot anyway, so it won't matter as much. Note also that, especially if you have a decent amount of RAM, subsequent runs of such a command will be faster because some of the files will be saved in your memory cache.

Of course, you can parameterize the 8 to make it easier to experiment with different numbers of concurrent processes.

As ed. mentions in the comments, it's quite possible that the performance of this approach will still be less impressive than that of a single-process grep -r. I guess it depends on the relative speed of your disk [array], the number of processors in your system, etc.

If you are willing to upgrade to version 3.2 or better, you can take advantage of the concurrent.futures.ProcessPoolExecutor. I think it will improve performance over the popen method you attempted because it will pre-create a pool of processes where your popen method creates a new process every time. You could write your own code to do the same thing for an earlier version if you can't move to 3.2 for some reason.

Let me also show you how to do this in Ray, which is an open-source framework for writing parallel Python applications. The advantage of this approach is that it is fast, easy to write and extend (say you want to pass a lot of data between the tasks or do some stateful accumulation), and can also be run on a cluster or the cloud without modifications. It's also very efficient at utilizing all cores on a single machine (even for very large machines like 100 cores) and data transfer between tasks.

import os
import ray
import re

ray.init()

patterns_file = os.path.expanduser("~/patterns")
topdir = os.path.expanduser("~/folder")

with open(patterns_file) as f:
    s_pat = r'(?:{})'.format('|'.join(line.strip() for line in f))

regex = re.compile(s_pat)

@ray.remote
def match(pattern, fname):
    results = []
    with open(fname, 'rt') as f:
        for line in f:
            if re.search(pattern, line):
                results.append(fname)
    return results

results = []
for dirpath, dirnames, filenames in os.walk(topdir):
    for fname in filenames:
        pathname = os.path.join(dirpath, fname)
        results.append(match.remote(regex, pathname))

print("matched files", ray.get(results))

More information including how to run this on a cluster or the cloud is available in the documentatation

来源：https://stackoverflow.com/questions/7623211/parallel-file-matching-python

标签

python

multithreading

string

parallel-processing

gil