Searching for a string in a large text file - profiling various methods in python

后端未结

关注

 6  534

This question has been asked many times. After spending some time reading the answers, I did some quick profiling to try out the various methods mentioned previously...

相关标签:

6条回答

余生分开走

2020-12-02 06:21

Without building an index file your searching will be to slow, and this is not so simple task. So better to use already developed software. The best way will be use Sphinx Search Engine.

0 讨论(0)
发布评论:

提交评论
- 加载中...
死守一世寂寞

2020-12-02 06:27

Variant 1 is great if you need to launch many sequential searches. Since set is internally a hash table, it's rather good at search. It takes time to build, though, and only works well if your data fit into RAM.

Variant 3 is good for very big files, because you have plenty of address space to map them and OS caches enough data. You do a full scan; it can become rather slow once your data stop to fit into RAM.

SQLite is definitely a nice idea if you need several searches in row and you can't fit the data into RAM. Load your strings into a table, build an index, and SQLite builds a nice b-tree for you. The tree can fit into RAM even if data don't (it's a bit like what @alienhard proposed), and even if it doesn't, the amount if I/O needed is dramatically lower. Of course, you need to create a disk-based SQLite database. I doubt that memory-based SQLite will beat Variant 1 significantly.

0 讨论(0)
发布评论:

提交评论
- 加载中...
伪装坚强ぢ

2020-12-02 06:35
You could also try
```
with open('input.txt') as f:
    # search_str is matched against each line in turn; returns on the first match:
    print search_str in f
```
with search_str ending with the proper newline sequence('\n' or '\r\n'). This should use little memory, as the file is read progressively. It should also be quite fast, since only part of the file is read.
0 讨论(0)
发布评论:

提交评论
- 加载中...
情歌与酒

2020-12-02 06:37

what about a text indexing solution ?

I would use Lucene in the Java world but there is a python engine called Whoosh

https://bitbucket.org/mchaput/whoosh/wiki/Home

0 讨论(0)
发布评论:

提交评论
- 加载中...
甜味超标

2020-12-02 06:38

I would guess many of the paths start out the same on DMOZ. You should use a trie data structure and store the individual characters on nodes.

Tries have O(m) lookup time (where m is the key length) also save a lot of space, when saving large dictionaries or tree like data.

You could also store path parts on nodes to reduce node count — this is called Patricia Trie. But that makes the lookup slower by the average string length comparison time. See SO question Trie (Prefix Tree) in Python for more info about implementations.

There are a couple of trie implementations on Python Package Index, but they are not very good. I have written one in Ruby and in Common Lisp, which is especially well suited for this task – if you ask nicely, I could maybe publish it as open source... :-)

0 讨论(0)
发布评论:

提交评论
- 加载中...
醉酒成梦

2020-12-02 06:41
Custom hash table search with externalized strings

To get fast access time and a lower memory consumption you could do the following:
- for each line compute a string hash and add it to a hash table, e.g., index[hash] = position (do not store the string). If there is a collision, store all file positions for that key in a list.
- to look up a string, compute its hash and look it up in the table. If the key is found, read the string at position from the file to verify you really have a match. If there are multiple positions check each one until you find a match or none.
Edit 1: replaced line_number by position (as pointed out by a commenter, one obviously needs the actual position and not line numbers)

Edit 2: provide code for an implementation with a custom hash table, which shows that this approach is more memory efficient than the other approaches mentioned:
```
from collections import namedtuple 
Node = namedtuple('Node', ['pos', 'next'])

def build_table(f, size):
    table = [ None ] * size
    while True:
        pos = f.tell()
        line = f.readline()
        if not line: break
        i = hash(line) % size
        if table[i] is None:
            table[i] = pos
        else:
            table[i] = Node(pos, table[i])
    return table

def search(string, table, f):
    i = hash(string) % len(table)
    entry = table[i]
    while entry is not None:
        pos = entry.pos if isinstance(entry, Node) else entry
        f.seek(pos)
        if f.readline() == string:
            return True
        entry = entry.next if isinstance(entry, Node) else None
    return False

SIZE = 2**24
with open('data.txt', 'r') as f:
    table = build_table(f, SIZE)
    print search('Some test string\n', table, f)
```
The hash of a line is only used to index into the table (if we used a normal dict, the hashes would also be stored as keys). The file position of the line is stored at the given index. Collisions are resolved with chaining, i.e., we create a linked list. However, the first entry is never wrapped in a node (this optimization makes the code a bit more complicated but it saves quite some space).

For a file with 6 million lines I chose a hash table size of 2^24. With my test data I got 933132 collisions. (A hash table of half the size was comparable in memory consumption, but resulted in more collisions. Since more collisions means more file access for searches, I would rather use a large table.)
```
Hash table: 128MB (sys.getsizeof([None]*(2**24)))
Nodes:       64MB (sys.getsizeof(Node(None, None)) * 933132)
Pos ints:   138MB (6000000 * 24)
-----------------
TOTAL:      330MB (real memory usage of python process was ~350MB)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...