I have large txt file with 100000 lines. I need to start n-count of threads and give every thread unique line from this file.
What is the best way to do this? I thin
If you want to limit the number of threads to n
, the easiest way is to use AsParallel()
along with WithDegreeOfParallelism(n)
to limit the thread count:
string filename = "C:\\TEST\\TEST.DATA";
int n = 5;
foreach (var line in File.ReadLines(filename).AsParallel().WithDegreeOfParallelism(n))
{
// Process line.
}
You can use the File.ReadLines Method to read the file line-by-line without loading the whole file into memory at once, and the Parallel.ForEach Method to process the lines in multiple threads in parallel:
Parallel.ForEach(File.ReadLines("file.txt"), (line, _, lineNumber) =>
{
// your code here
});
As @dtb mentioned above, the fastest way to read a file and then process the individual lines in a file is to: 1) do a File.ReadAllLines() into an array 2) Use a Parallel.For loop to iterate over the array.
You can read more performance benchmarks here.
The basic gist of the code you would have to write is:
string[] AllLines = File.ReadAllLines(fileName);
Parallel.For(0, AllLines.Length, x =>
{
DoStuff(AllLines[x]);
//whatever you need to do
});
With the introduction of bigger array sizes in .Net4, as long as you have plenty of memory, this shouldn't be an issue.
Read the file on one thread, adding its lines to a blocking queue. Start N
tasks reading from that queue. Set max size of the queue to prevent out of memory errors.
After performing my own benchmarks for loading 61,277,203 lines into memory and shoving values into a Dictionary / ConcurrentDictionary() the results seem to support @dtb's answer above that using the following approach is the fastest:
Parallel.ForEach(File.ReadLines(catalogPath), line =>
{
});
My tests also showed the following:
I have included an example of this pattern for reference, since it is not included on this page:
var inputLines = new BlockingCollection<string>();
ConcurrentDictionary<int, int> catalog = new ConcurrentDictionary<int, int>();
var readLines = Task.Factory.StartNew(() =>
{
foreach (var line in File.ReadLines(catalogPath))
inputLines.Add(line);
inputLines.CompleteAdding();
});
var processLines = Task.Factory.StartNew(() =>
{
Parallel.ForEach(inputLines.GetConsumingEnumerable(), line =>
{
string[] lineFields = line.Split('\t');
int genomicId = int.Parse(lineFields[3]);
int taxId = int.Parse(lineFields[0]);
catalog.TryAdd(genomicId, taxId);
});
});
Task.WaitAll(readLines, processLines);
Here are my benchmarks:
I suspect that under certain processing conditions, the producer / consumer pattern might outperform the simple Parallel.ForEach(File.ReadLines()) pattern. However, it did not in this situation.
Something like:
public class ParallelReadExample
{
public static IEnumerable LineGenerator(StreamReader sr)
{
while ((line = sr.ReadLine()) != null)
{
yield return line;
}
}
static void Main()
{
// Display powers of 2 up to the exponent 8:
StreamReader sr = new StreamReader("yourfile.txt")
Parallel.ForEach(LineGenerator(sr), currentLine =>
{
// Do your thing with currentLine here...
} //close lambda expression
);
sr.Close();
}
}
Think it would work. (No C# compiler/IDE here)