The way to achieve this is to take the initial text, and split it by whitespace into a string array using string.split(' ');
Next, you need to iterate over every string in the array.
This is easy for single words, but more complex for groups.
For this reason, you need to define a group size. You must control the number of places in the array the pointer advances for each iteration.
Once you are able to iterate the array, you need to grab the group of words (however long you have chosen it to be), and store it somewhere.
Your dictionary in the example is a good approach.
If the dictionary contains the word group, you increment its value by one.
If it doesn't contain the group, just add it with a default value of 1.
if (wordList.ContainsKey(theKey)) {
wordList[theKey]++;
} else {
wordList.Add(theKey, 1);
}
You do rightly mention that your research showed that regex is not high performance. For this task, regex is completely the wrong tool - you're not looking for patterns, you're examining groups.
For that, you have to go through the text from start to finish, checking values.
Any task that involves iterating through text and running a repeating function on it should never use regex.
EDIT: My initial assumption of the performance of Regex was not correct - in C#, it seems to be a great deal faster than in Java, yet I would still maintain that a pure regex approach is not as quick as using regex to tokenise the text then use either a loop or linq expression to find the groups.
Stating
@galakt As I mentioned above, let's say 3. Does it matter?
The idea of a word group is entirely abstract. Yes, it's a group of words, but the whole block of text is a group of words.
Rules have to be applied to govern how you act on that group of words.
Below is a sample method which will return a dictionary of all the word groups based on a size passed via the method call.
It does not strip any non-alphanumeric chars from the text, but it is fast, even with larger group sizes.
To call it, use Dictionary<String, int> SingleWordGroups = GetWordGroupInstances(1);
private Dictionary<String, int> GetWordGroupInstances(int GroupSize) {
Dictionary<String, int> WordGroupInstances = new Dictionary<string, int>();
//Grab the string to work from...
String[] sourceText = GetSourceText().Split(' ');
int pointer = 0;
StringBuilder groupBuilder = new StringBuilder();
while (pointer < sourceText.Length - GroupSize) {
groupBuilder.Clear();
int offset = pointer + GroupSize;
for (int i = pointer; i < offset; i++) {
//prepend a space to allow separation between words in groups.
//We can make a substring from this later starting from char 1
//to lose the initial whitespace from the string.
groupBuilder.Append(" ").Append(sourceText[i]);
}
String key = groupBuilder.ToString().Substring(1);
if (!WordGroupInstances.ContainsKey(key)) {
WordGroupInstances.Add(key, 1);
} else {
WordGroupInstances[key]++;
}
/**
* Setting the pointer to increase by group size grabs a group, moves on
* to the end of the group, and grabs the next.
*
*/
pointer += GroupSize;
/**
* Setting the point to increment by 1 grabs a group, advances by 1 word, then
* grabs the next, so from the phrase - "Hello world, I'm some text", the groups of size 2 would be
* "Hello world,", "world, I'm", "I'm some" etc...
*/
//pointer++;
}
return WordGroupInstances;
}
The method below is modified to produce all the group output in turn, so
The
The Green
Green Algae
The Green Algae
etc...
It's worth noting that the entire text must be converted to either lower or upper case so that words are not case dependent.
I've refined this a little to improve the performance (and remove the need for the break instruction).
private Dictionary<String, int> GetAllGroups() {
Dictionary<string, int> WordGroupInstances = new Dictionary<string, int>();
StringBuilder groupBuilder = new StringBuilder();
String[] sourceText = GetSourceText().Split(' ');
for (int i = 0; i < sourceText.Length; i++) {
groupBuilder.Clear();
for (int j = i; j < sourceText.Length; j++) {
groupBuilder.Append(" ").Append(sourceText[j]);
String key = groupBuilder.ToString().Substring(1);
if (!WordGroupInstances.ContainsKey(key)) {
WordGroupInstances.Add(key, 1);
} else {
WordGroupInstances[key]++;
}
}
}
return WordGroupInstances;
}
After performance testing with the corpus of text (288 words), it will create the 41773 word groups in 0.171886 seconds.