The standard way to do this is to implement the Aho-Corasick algorithm. It reads the file one time and finds all occurrences of all the strings you give it. See https://www.informit.com/guides/content.aspx?g=dotnet&seqNum=869 for an article that provides an implementation and some examples.
Update after more info
Assuming that the list of numbers in your file A is small enough to fit in memory, here's what you'd do, using the implementation in the above-linked article:
// Construct the automaton
AhoCorasickStringSearcher matcher = new AhoCorasickStringSearcher();
foreach (var searchWord in File.ReadLines(File_a)
{
matcher.AddItem(searchWord);
}
matcher.CreateFailureFunction();
// And then do the search on each file
foreach (var fileName in listOfFiles)
{
foreach (var line in File.ReadLines(filename))
{
var matches = matcher.Search(line);
foreach (m in matches)
{
// output match
}
}
}
Note that it only makes one pass through each file, and it never has to load more than one line of the file into memory at any time. The limiting factor here is the memory it takes to build the automaton.
I've used this to search files that totaled over 100 gigabytes, for about 15 million different strings. It takes a few minutes to build the automaton, but then it searches very quickly. One really nice property of the algorithm is that its complexity is O(n + m), where n is the size of the input files, and m is the number of matched items. The number of strings it's searching for doesn't matter. It can search for a million different strings just as quickly as it can search for one or two.
100 gigabytes will take you ... something on the order of about 40 minutes to read. I'd be really surprised if it took an hour for this to find all occurrences of 15 million different strings in 100 gigabytes of data.
Matching whole words
Another option, if you're searching for whole words is to ditch the Aho-Corasick algorithm. Instead, load all of the numbers you're looking for into a HashSet<string>
. Then read each line and use a regular expression to find all of the numbers in the line and check to see if they exist in the hash set. For example:
Regex re = new Regex("\w+");
foreach (var line in File.ReadLines(filename))
{
var matches = re.Matchs(line);
foreach (var m in matches)
{
if (hashSetOfValues.Contains(m))
{
// output match
}
}
}
This will likely be somewhat slower than the Aho-Corasick algorithm, but it still makes only one pass through the data. This assumes, of course, that you have enough memory to hold all of those numbers in a hash set.
There are other options for whole words, as I mention in the comments.
Another option, if you know that the words you're looking for are always separated by spaces, is to add spaces to the start and end of the words that you add to the automaton. Or, with some modification to the implementation itself, you could force the matcher's Search
method to only return matches that occur in whole words. That could more easily handle matches at the start and end of lines, and additional non-word characters.