Read large txt file multithreaded?

前端 未结 6 1745
忘了有多久
忘了有多久 2020-11-30 02:17

I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.

What is the best way to do this? I thin

相关标签:
6条回答
  • 2020-11-30 02:35

    If you want to limit the number of threads to n, the easiest way is to use AsParallel() along with WithDegreeOfParallelism(n) to limit the thread count:

    string filename = "C:\\TEST\\TEST.DATA";
    int n = 5;
    
    foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
    {
        // Process line.
    }
    
    0 讨论(0)
  • 2020-11-30 02:38

    You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:

    Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
    {
        // your code here
    });
    
    0 讨论(0)
  • 2020-11-30 02:38

    As @dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to: 1) do a File.ReadAllLines() into an array 2) Use a Parallel.For loop to iterate over the array.

    You can read more performance benchmarks here.

    The basic gist of the code you would have to write is:

    string[] AllLines = File.ReadAllLines(fileName);
    Parallel.For(0, AllLines.Length, x =>
    {
        DoStuff(AllLines[x]);
        //whatever you need to do
    });
    

    With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.

    0 讨论(0)
  • 2020-11-30 02:39

    Read the file on one thread, adding its lines to a blocking queue. Start N tasks reading from that queue. Set max size of the queue to prevent out of memory errors.

    0 讨论(0)
  • 2020-11-30 02:42

    After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:

    Parallel.ForEach(File.ReadLines(catalogPath), line =>
    {
    
    }); 
    

    My tests also showed the following:

    1. File.ReadAllLines() and File.ReadAllLines().AsParallel() appear to run at almost exactly the same speed on a file of this size. Looking at my CPU activity, it appears they both seem to use two out of my 8 cores?
    2. Reading all the data first using File.ReadAllLines() appears to be much slower than using File.ReadLines() in a Parallel.ForEach() loop.
    3. I also tried a producer / consumer or MapReduce style pattern where one thread was used to read the data and a second thread was used to process it. This also did not seem to outperform the simple pattern above.

    I have included an example of this pattern for reference, since it is not included on this page:

    var inputLines = new BlockingCollection<string>();
    ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
    
    var readLines = Task.Factory.StartNew(() =>
    {
        foreach (var line in File.ReadLines(catalogPath)) 
            inputLines.Add(line);
    
            inputLines.CompleteAdding(); 
    });
    
    var processLines = Task.Factory.StartNew(() =>
    {
        Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
        {
            string[] lineFields = line.Split('\t');
            int genomicId = int.Parse(lineFields[3]);
            int taxId = int.Parse(lineFields[0]);
            catalog.TryAdd(genomicId, taxId);   
        });
    });
    
    Task.WaitAll(readLines, processLines);
    

    Here are my benchmarks:

    enter image description here

    I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.

    0 讨论(0)
  • 2020-11-30 02:49

    Something like:

    public class ParallelReadExample
    {
        public static IEnumerable LineGenerator(StreamReader sr)
        {
            while ((line = sr.ReadLine()) != null)
            {
                yield return line;
            }
        }
    
        static void Main()
        {
            // Display powers of 2 up to the exponent 8:
            StreamReader sr = new StreamReader("yourfile.txt")
    
            Parallel.ForEach(LineGenerator(sr), currentLine =>
                {
                    // Do your thing with currentLine here...
                } //close lambda expression
            );
    
            sr.Close();
        }
    }
    

    Think it would work. (No C# compiler/IDE here)

    0 讨论(0)
提交回复
热议问题