Searching for a string in an input stream

拟墨画扇 提交于 2019-12-08 16:22:54

问题


I have a large binary file (many gigabytes, so loading it into memory is not an option) that I want to search for all occurrences of the string "icpf".

I tried using std::search for this, but just got bitten by the fact that std::search only works for forward iterators, not input iterators.

Does the standard library provide a fast alternative for this? Or do I need to hand-code the search (either reading in chunks at a time then std::search on those, or ignore everything until an 'i' and then manually check the next three characters)?


回答1:


Does the standard library provide a fast alternative for this?

Although the standard C++ library offers ways to search text streams, it does not offer comparable algorithms for binary streams.

Or do I need to hand-code the search (either reading in chunks at a time then std::search on those, or ignore everything until an 'i' and then manually check the next three characters)?

Coding the "skip and search" approach could be tricky, because it is easy to code a solution that skips entries. For example, if you are looking for "icpf" in a file containing "icpicpf", a simple program that processes one character at a time will fail to find "icpf" suffix after discarding "icpi" prefix.

If you are going to code this yourself, consider implementing Knuth–Morris–Pratt algorithm. There are many implementations available online, and it operates correctly on streams, because it considers one character at a time, and never goes back.




回答2:


The fastest method is to load the entire file into memory, then search the memory.

The next best alternative is to keep the hard drive in motion. Perhaps have one thread that reads chunks of data into a buffer and another thread that searches the buffer.

Going down the list, reading in large chunks of data into a buffer, then searching the buffer is a good technique, although not as efficient as the previous methods.

You could read line by line, using std::getline and std::string. This is not as fast as block reading because the input function is searching for the newline character (and allocating memory in the std::string).

The worst case is probably reading character by character. The function overhead is bad for reading a single characters (usually the overhead is the same for reading a large block of data).

No, there is no standard C++ library function for searching files. Some Operating Systems have utilities for searching files; perhaps you can use one of those.

Edit 1:
The bottleneck is inputting the data. Once you get the data into a buffer, then are many efficient search algorithms rather than the brute force (searching for the first letter, then searching for the next letters, etc.).

Search the internet for "string search algorithm".




回答3:


I don't know of any pure standard library solution, but the kernel already implements prefetching, so it should be possible to mmap() the file to get the required forward iterators: (error handling omitted)

size_t search(int fd, size_t fileSize) {
    auto start = reinterpret_cast<char*>(
        ::mmap(nullptr, fileSize, PROT_READ, MAP_PRIVATE | MAP_NORESERVE, fd, 0));
    ::madvise(start, fileSize, MADV_SEQUENTIAL);
    auto pattern = "icpf";
    auto offset = std::search(start, start+fileSize, pattern, pattern+4);
    return offset - start;
}

Its a small leap of faith, trusting your kernel to do the lazy loading, prefetching and discarding correctly. On the other hand, if you can trust anyone with this, it would probably be kernel developers.

Disclaimer: I didn't actually test this on a multi-gigabyte file.



来源:https://stackoverflow.com/questions/35560124/searching-for-a-string-in-an-input-stream

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!