I need to split a string at all whitespace, it should ONLY contain the words themselves.
How can I do this in vb.net?
Tabs, Newlines, etc. must all be split
So, after seeing Adam Ralph's post, I suspected his solution of being faster than the Regex solution. Just thought I'd share the results of my testing since I did find it was faster.
There are really two factors at play (ignoring system variables): number of sub-strings extracted (determined by number of delimiters), and total string length. The very simple scenario plotted below uses "A" as the sub-string delimited by two white space characters (a space followed by tab). This accentuates the effect of number of sub-strings extracted. I went ahead and did some multiple variable testing to arrive at the following general equations for my operating system.
Regex()
t = (28.33*SSL + 572)(SSN/10^6)
Split().Where()
t = (6.23*SSL + 250)(SSN/10^6)
Where t is execution time in milliseconds, SSL is average sub-string length, and SSN is number of sub-strings delimited in string.
These equations can also written as
t = (28.33*SL + 572*SSN)/10^6
and
t = (6.23*SL + 250*SSN)/10^6
where SL is total string length (SL = SSL * SSN)
Conclusion: The Split().Where() solution is faster than Regex(). The major factor is number of sub-strings, while string length plays a minor role. Performance gains are about 2x and 5x for the respective coefficients.
Here's my testing code (probably way more material than necessary, but it's set-up for getting the multi-variable data I talked about)
using System;
using System.Linq;
using System.Diagnostics;
using System.Text.RegularExpressions;
using System.Windows.Forms;
namespace ConsoleApplication1
{
class Program
{
public enum TestMethods {regex, split};
[STAThread]
static void Main(string[] args)
{
//Compare TestMethod execution times and output result information
//to the console at runtime and to the clipboard at program finish (so that data is ready to paste into analysis environment)
#region Config_Variables
//Choose test method from TestMethods enumerator (regex or split)
TestMethods TestMethod = TestMethods.split;
//Configure RepetitionString
String RepetitionString = string.Join(" \t", Enumerable.Repeat("A",100));
//Configure initial and maximum count of string repetitions (final count may not equal max)
int RepCountInitial = 100;int RepCountMax = 1000 * 100;
//Step increment to next RepCount (calculated as 20% increase from current value)
Func<int, int> Step = x => (int)Math.Round(x / 5.0, 0);
//Execution count used to determine average speed (calculated to adjust down to 1 execution at long execution times)
Func<double, int> ExecutionCount = x => (int)(1 + Math.Round(500.0 / (x + 1), 0));
#endregion
#region NonConfig_Variables
string s;
string Results = "";
string ResultInfo;
double ResultTime = 1;
#endregion
for (int RepCount = RepCountInitial; RepCount < RepCountMax; RepCount += Step(RepCount))
{
s = string.Join("", Enumerable.Repeat(RepetitionString, RepCount));
ResultTime = Test(s, ExecutionCount(ResultTime), TestMethod);
ResultInfo = ResultTime.ToString() + "\t" + RepCount.ToString() + "\t" + ExecutionCount(ResultTime).ToString() + "\t" + TestMethod.ToString();
Console.WriteLine(ResultInfo);
Results += ResultInfo + "\r\n";
}
Clipboard.SetText(Results);
}
public static double Test(string s, int iMax, TestMethods Method)
{
switch (Method)
{
case TestMethods.regex:
return Math.Round(RegexRunTime(s, iMax),2);
case TestMethods.split:
return Math.Round(SplitRunTime(s, iMax),2);
default:
return -1;
}
}
private static double RegexRunTime(string s, int iMax)
{
Stopwatch sw = new Stopwatch();
sw.Restart();
for (int i = 0; i < iMax; i++)
{
System.Collections.Generic.IEnumerable<string> ens = Regex.Split(s, @"\s+");
}
sw.Stop();
return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
}
private static double SplitRunTime(string s,int iMax)
{
Stopwatch sw = new Stopwatch();
sw.Restart();
for (int i = 0; i < iMax; i++)
{
System.Collections.Generic.IEnumerable<string> ens = s.Split().Where(x => x != string.Empty);
}
sw.Stop();
return Math.Round(sw.ElapsedMilliseconds / (double)iMax, 2);
}
}
}