Fast Text Search Over Logs

问题

Here's the problem I'm having, I've got a set of logs that can grow fairly quickly. They're split into individual files every day, and the files can easily grow up to a gig in size. To help keep the size down, entries older than 30 days or so are cleared out.

The problem is when I want to search these files for a certain string. Right now, a Boyer-Moore search is unfeasibly slow. I know that applications like dtSearch can provide a really fast search using indexing, but I'm not really sure how to implement that without taking up twice the space a log already takes up.

Are there any resources I can check out that can help? I'm really looking for a standard algorithm that'll explain what I should do to build an index and use it to search.

Edit:
Grep won't work as this search needs to be integrated into a cross-platform application. There's no way I'll be able to swing including any external program into it.

The way it works is that there's a web front end that has a log browser. This talks to a custom C++ web server backend. This server needs to search the logs in a reasonable amount of time. Currently searching through several gigs of logs takes ages.

Edit 2: Some of these suggestions are great, but I have to reiterate that I can't integrate another application, it's part of the contract. But to answer some questions, the data in the logs varies from either received messages in a health-care specific format or messages relating to these. I'm looking to rely on an index because while it may take up to a minute to rebuild the index, searching currently takes a very long time (I've seen it take up to 2.5 minutes). Also, a lot of the data IS discarded before even recording it. Unless some debug logging options are turned on, more than half of the log messages are ignored.

The search basically goes like this: A user on the web form is presented with a list of the most recent messages (streamed from disk as they scroll, yay for ajax), usually, they'll want to search for messages with some information in it, maybe a patient id, or some string they've sent, and so they can enter the string into the search. The search gets sent asychronously and the custom web server linearly searches through the logs 1MB at a time for some results. This process can take a very long time when the logs get big. And it's what I'm trying to optimize.

回答1:

Check out the algorithms that Lucene uses to do its thing. They aren't likely to be very simple, though. I had to study some of these algorithms once upon a time, and some of them are very sophisticated.

If you can identify the "words" in the text you want to index, just build a large hash table of the words which maps a hash of the word to its occurrences in each file. If users repeat the same search frequently, cache the search results. When a search is done, you can then check each location to confirm the search term falls there, rather than just a word with a matching hash.

Also, who really cares if the index is larger than the files themselves? If your system is really this big, with so much activity, is a few dozen gigs for an index the end of the world?

回答2:

grep usually works pretty well for me with big logs (sometimes 12G+). You can find a version for windows here as well.

回答3:

You'll most likely want to integrate some type of indexing search engine into your application. There are dozens out there, Lucene seems to be very popular. Check these two questions for some more suggestions:

Best text search engine for integrating with custom web app?

How do I implement Search Functionality in a website?

回答4:

More details on the kind of search you're performing could definitely help. Why, in particular do you want to rely on an index, since you'll have to rebuild it every day when the logs roll over? What kind of information is in these logs? Can some of it be discarded before it is ever even recorded?

How long are these searches taking now?

回答5:

You may want to check out the source for BSD grep. You may not be able to rely on grep being there for you, but nothing says you can't recreate similar functionality, right?

回答6:

Splunk is great for searching through lots of logs. May be overkill for your purpose. You pay according to the amount of data (size of the logs) you want to process. I'm pretty sure they have an API so you don't have to use their front-end if you don't want to.

来源：https://stackoverflow.com/questions/163783/fast-text-search-over-logs

标签

algorithm

full-text-search

scalability