C# 3.0 (with LINQ)
Here's my solution. It makes use of some pretty nice features of LINQ/extension methods to keep the code short.
public static Dictionary<string, int> GetKeywords(string text, int minCount, int minLength)
{
var commonWords = new string[] { "and", "is", "the", "as", "of", "to", "or", "in",
"for", "by", "an", "be", "may", "has", "can", "its"};
var words = Regex.Replace(text.ToLower(), @"[,.?\/;:\(\)]", string.Empty).Split(' ');
var occurrences = words.Distinct().Except(commonWords).Select(w =>
new { Word = w, Count = words.Count(s => s == w) });
return occurrences.Where(wo => wo.Count >= minCount && wo.Word.Length >= minLength)
.ToDictionary(wo => wo.Word, wo => wo.Count);
}
This is however far from the most efficient method, being O(n^2)
with the number of words, rather than O(n)
, which is optimal in this case I believe. I'll see if I can creater a slightly longer method that is more efficient.
Here are the results of the function run on the sample text (min occurences: 3, min length: 2).
3 x such
4 x code
4 x which
4 x declarations
5 x function
4 x statements
3 x new
3 x types
3 x keywords
7 x statement
3 x language
3 x expression
3 x execution
3 x programming
4 x operators
3 x variables
And my test program:
static void Main(string[] args)
{
string sampleText;
using (var client = new WebClient())
sampleText = client.DownloadString("http://sampsonresume.com/labs/c.txt");
var keywords = GetKeywords(sampleText, 3, 2);
foreach (var entry in keywords)
Console.WriteLine("{0} x {1}", entry.Value.ToString().PadLeft(3), entry.Key);
Console.ReadKey(true);
}