Searching for line of one text file in another text file, faster

问题

Is there a faster way to search each line of one text file for occurrence in another text file, than by going line by line in both files?

I have two text files - one has ~2500 lines (let's call it TxtA), the other has ~86000 lines(TxtB). I want to search TxtB for each line in TxtA, and return the line in TxtB for each match found.

I currently have this setup as: for each line in TxtA, search TxtB line by line for a match. However this is taking a really long time to process. It seems like it would take 1-3 hours to find all the matches.

Here is my code...

private static void getGUIDAndType()
    {

        try
        {

            Console.WriteLine("Begin.");
            System.Threading.Thread.Sleep(4000);

            String dbFilePath = @"C:\WindowsApps\CRM\crm_interface\data\";
            StreamReader dbsr = new StreamReader(dbFilePath + "newdbcontents.txt");
            List<string> dblines = new List<string>();

            String newDataPath = @"C:\WindowsApps\CRM\crm_interface\data\";
            StreamReader nsr = new StreamReader(newDataPath + "HolidayList1.txt");
            List<string> new1 = new List<string>();

            string dbline;
            string newline;

            List<string> results = new List<string>();

            while ((newline = nsr.ReadLine()) != null)
            {
                //Reset
                dbsr.BaseStream.Position = 0;
                dbsr.DiscardBufferedData();

                while ((dbline = dbsr.ReadLine()) != null)
                {
                    newline = newline.Trim();
                    if (dbline.IndexOf(newline) != -1)
                    {//if found... get all info for now
                        Console.WriteLine("FOUND: " + newline);
                        System.Threading.Thread.Sleep(1000);
                        new1.Add(newline);
                        break;
                    }
                    else
                    {//the first line of db does not contain this line... 
                        //go to next dbline. 
                        Console.WriteLine("Lines do not match - continuing");
                        continue;
                    }
                }
                Console.WriteLine("Going to next new Line");
                System.Threading.Thread.Sleep(1000);
                //continue;
            }

            nsr.Close();

            Console.WriteLine("Writing to dbc3.txt");
            System.IO.File.WriteAllLines(@"C:\WindowsApps\CRM\crm_interface\data\dbc3.txt", results.ToArray());
            Console.WriteLine("Finished. Press ENTER to continue.");

            Console.WriteLine("End.");
            Console.ReadLine();
        }
        catch (Exception ex)
        {
            Console.WriteLine("Error:  " + ex);
            Console.ReadLine();
        }
    }

Please let me know if there is a faster way. Preferably something that would take 5-10 minutes... I've heard of indexing but didn't find much on this for txt files. I've tested regex and it's no faster than indexof. Contains won't work because the lines will never be exactly the same.

Thanks.

回答1:

EDIT: Note that I'm assuming it's reasonable to read at least one file into memory. You may want to swap the queries below around to avoid loading the "big" file into memory, but even 86,000 lines at (say) 1K per line is going to be less than 2G of memory - which is relatively little to do something significant.

You're reading the "inner" file each time. There's no need for that. Load both files into memory and go from there. Heck, for exact matches you can do the whole thing in LINQ easily:

var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
            join line2 in File.ReadLines(dbFilePath + "newdbcontents.txt")
            on line1 equals line2
            select line1;

var commonLines = query.ToList();

But for non-joins it's still simple; just read one file completely first (explicitly) and then stream the other:

// Eagerly read the "inner" file
var lines2 = File.ReadAllLines(dbFilePath + "newdbcontents.txt");
var query = from line1 in File.ReadLines("newDataPath + "HolidayList1.txt")
            from line2 in lines2
            where line2.Contains(line1)
            select line1;

var commonLines = query.ToList();

There's nothing clever here - it's just a really simple way of writing code to read all the lines in one file, then iterate over the lines in the other file and for each line check against all the lines in the first file. But even without anything clever, I strongly suspect it would perform well enough for you. Concentrate on simplicity, eliminate unnecessary IO, and see whether that's good enough before trying to do anything fancier.

Note that in your original code, you should be using using statements for your StreamReader variables, to ensure they get disposed properly. Using the above code makes it simple to not even need that though...

回答2:

There might be a faster way, but this LINQ apporoach should be faster than 3 hours and is a sight better to read and maintain:

var f1Lines    = File.ReadAllLines(f1Path);
var f2LineInf1 = File.ReadLines(f2Path)
                .Where( line => f1Lines.Contains(line))
                .Select(line => line).ToList();

Edit: tested and required less than 1 second for 400000 lines in file2 and 17000 lines in file1. I can use File.ReadLines for the big file which does not load all into memory at once. For the smaller file i need to use File.ReadAllLines since Contains needs the complete list of lines of file 1.

If you want to log the result in a third file:

File.WriteAllLines(logPath, f2LineInf1);

回答3:

Quick and dirty because I've got to go... If you can do it in memory, try working with this snippet:

    //string[] searchIn = File.ReadAllLines("File1.txt");
    //string[] searchFor = File.ReadAllLines("File2.txt");

    string[] searchIn = new string[] {"A","AB","ABC","ABCD", null, "", "    "};
    string[] searchFor = new string[] {"A","BC","BCD", null, "", "   "};

    matchDictionary;

    foreach(string item in file2Content)
    {
        string[] matchingItems = Array.FindAll(searchIn, x => (x == item) || (!string.IsNullOrEmpty(x) && !string.IsNullOrEmpty(item) ? (x.Contains(item) || item.Contains(x)) : false));
    }

来源：https://stackoverflow.com/questions/9491181/searching-for-line-of-one-text-file-in-another-text-file-faster

标签

text-files

indexof