I am about to start a project which will be taking blocks of text, parsing a lot of data into them into some sort of object which can then be serialized, stored, and statistics
How much slower is using regex's to substr.
If you are looking for an exact string, substr will be faster. Regular expressions however are highly optimized. They (or at least parts) are compiled to IL and you can even store these compiled versions in a separate assembly using Regex.CompileToAssembly
. See http://msdn.microsoft.com/en-us/library/9ek5zak6.aspx for more information.
What you really need to do is do perform measurements. Using something like Stopwatch
is by far the easiest way to verify whether one or the other code construct works faster.
What sort of optimizations (if any) can I do to maximize parallelism.
With Task.Factory.StartNew
, you can schedule tasks to run on the thread pool. You may also have a look at the TPL (Task Parallel Library, of which Task
is a part). This has lots of constructs that help you parallelize work and allows constructs like Parallel.ForEach()
to execute an iteration on multiple threads. See http://msdn.microsoft.com/en-us/library/dd460717.aspx for more information.
Anything else I haven't considered?
One of the things that will hurt you with this volume of data is memory management. A few things to take into account:
Limit memory allocation: try to re-use the same buffers for a single document instead of copying them when you only need a part. Say you need to work on a range starting at char 1000 to 2000, don't copy that range into a new buffer, but construct your code to work only in that range. This will make your code complexer, but it saves you memory allocations;
StringBuilder
is an important class. If you don't know of it yet, have a look.
Google had recently announced it's internal text processing language (which seems like a Python/Perl subset made for heavily parallel processing).
http://code.google.com/p/szl/ - Sawzall.
I don't know what kind of processing you're doing here, but if you're talking hundreds of thousands of strings per day, it seems like a pretty small number. Let's assume that you get 1 million new strings to process every day, and you can fully task 10 of those 12 Xeon cores. That's 100,000 strings per core per day. There are 86,400 seconds in a day, so we're talking 0.864 seconds per string. That's a lot of parsing.
I'll echo the recommendations made by @Pieter, especially where he suggests making measurements to see how long it takes to do your processing. Your best bet is to get something up and working, then figure out how to make it faster if you need to. I think you'll be surprised at how often you don't need to do any optimization. (I know that's heresy to the optimization wizards, but processor time is cheap and programmer time is expensive.)
How much slower is using regex's to substr?
That depends entirely on how complex your regexes are. As @Pieter said, if you're looking for a single string, String.Contains
will probably be faster. You might also consider using String.IndexOfAny
if you're looking for constant strings. Regular expressions aren't necessary unless you're looking for patterns that can't be represented as constant strings.
Is .NET going to be significantly slower than other languages?
In processor-intensive applications, .NET can be slower than native apps. Sometimes. If so, it's typically in the range of 5 to 20 percent, and most often between 7 and 12 percent. That's just the code executing in isolation. You have to take into account other factors like how long it takes you to build the program in that other language and how difficult it is to share data between the native app and the rest of your system.
If you want to do fast string parsing in C#, you might want to consider having a look at the new NLib project. It contains string extensions to facilitate searching strings in various ways rapidly. Such as, IndexOfAny(string[]) and IndexOfNotAny. They contain overloads with a StringComparison argument too.