How to find recurring word groups in text with C#? [closed]

前端未结

关注

 3  969

清歌不尽

相关标签:

3条回答

夕颜

2021-01-06 01:02

Here is a streaming approach that recursively builds up groups of size N (3 in this example) from an enumerable of words. It doesn't matter how you tokenize your input into words (I've used a simple regex in this example).

//tokenize input (enumerable of string)
var words = Regex.Matches(input, @"\w+").Cast<Match>().Select(m => m.Value);

//get word groups (enumerable of string[])
var groups = GetWordGroups(words, 3);

//do what you want with your groups; suppose you want to count them
var counts = new Dictionary<string, int>(StringComparer.CurrentCultureIgnoreCase);
foreach (var group in groups.Select(g => string.Join(" ", g)))
{
    int count;
    counts.TryGetValue(group, out count);
    counts[group] = ++count;
}


IEnumerable<string[]> GetWordGroups(IEnumerable<string> words, int size)
{
    if (size <= 0) throw new ArgumentOutOfRangeException();
    if (size == 1)
    {
        foreach (var word in words)
        {
            yield return new string[] { word };
        }

        yield break;
    }

    var prev = new string[0];

    foreach (var next in GetWordGroups(words, size - 1))
    {
        yield return next;

        //stream of groups includes all groups up to size - 1, but we only combine groups of size - 1
        if (next.Length == size - 1)
        {
            if (prev.Length == size - 1)
            {
                var group = new string[size];
                Array.Copy(prev, 0, group, 0, prev.Length);
                group[group.Length - 1] = next[next.Length - 1];
                yield return group;
            }

            prev = next;
        }
    }
}

One advantage with a streaming approach such as this is you minimize the number of strings that must be held in memory at any time (this reduces memory use for large bodies of text). Depending on how you receive your input, another optimization may be to use a TextReader to produce an enumeration of tokens as you read the input.

An example of the intermediate grouping output follows (each item is actually array of tokens, joined with a white space for output here):

The 
green 
The green 
algae 
green algae 
The green algae 
singular 
algae singular 
green algae singular 
green 
singular green 
algae singular green 
alga 
green alga 
singular green alga

0 讨论(0)

孤城傲影

2021-01-06 01:04
I think that this works fairly well.
```
var text = @"The green algae (singular: green alga) are ..."; // include all your text

var remove = "().,:[]0123456789".Select(x => x.ToString()).ToArray();

var words =
    Regex
        .Matches(text, @"(\S+)")
        .Cast<Match>()
        .SelectMany(x => x.Captures.Cast<Capture>())
        .Select(x => remove.Aggregate(x.Value, (t, r) => t.Replace(r, "")))
        .Select(x => x.Trim().ToLowerInvariant())
        .Where(x => !String.IsNullOrWhiteSpace(x))
        .ToArray();

var groups =
    from n1 in Enumerable.Range(0, words.Length)
    from n2 in Enumerable.Range(1, words.Length - n1)
    select String.Join(" ", words.Skip(n1).Take(n2));

var frequencies =
    groups
        .GroupBy(x => x)
        .Select(x => new { wordgroup = x.Key, count = x.Count() })
        .OrderByDescending(x => x.count)
        .ThenBy(x => x.wordgroup.Count(y => y == ' '))
        .ThenBy(x => x.wordgroup)
        .ToArray();
```
This gives me the frequency of every single word grouping of contiguous sequences of words including up to a single word group of all the words.

The number of words is 288. The total number of word groups is 288 x (288 + 1) / 2 = 41,616. The final number of word groups (after grouping duplicate word groups and removing empty/whitespace strings) is 41,449.

Here are the first 100 of these 41,449:

20 x "the", 13 x "and", 12 x "algae", 12 x "in", 11 x "green", 10 x "of", 9 x "green algae", 8 x "are", 6 x "as", 6 x "species", 5 x "a", 4 x "is", 4 x "or", 4 x "to", 3 x "embryophytes", 3 x "form", 3 x "found", 3 x "lichens", 3 x "live", 3 x "on", 3 x "plants", 3 x "that", 3 x "algae and", 3 x "and in", 3 x "as the", 3 x "in the", 3 x "of the", 2 x "alga", 2 x "can", 2 x "clade", 2 x "class", 2 x "colonial", 2 x "filamentous", 2 x "from", 2 x "higher", 2 x "macroscopic", 2 x "most", 2 x "other", 2 x "seaweeds", 2 x "their", 2 x "trentepohlia", 2 x "while", 2 x "with", 2 x "algae are", 2 x "are a", 2 x "green alga", 2 x "higher plants", 2 x "in lichens", 2 x "of green", 2 x "species of", 2 x "the clade", 2 x "the green", 2 x "green algae and", 2 x "green algae are", 2 x "of green algae", 2 x "species of green", 2 x "the green algae", 2 x "species of green algae", 1 x "about", 1 x "acquired", 1 x "algal", 1 x "also", 1 x "associations", 1 x "bark", 1 x "be", 1 x "both", 1 x "cannot", 1 x "cell", 1 x "cells", 1 x "cellular", 1 x "charales", 1 x "charophyte", 1 x "charophytes", 1 x "chlorarachniophytes", 1 x "chlorophyte", 1 x "chloroplasts", 1 x "ciliate", 1 x "closest", 1 x "coccoid", 1 x "coenobia", 1 x "colonies", 1 x "conduct", 1 x "consisting", 1 x "differentiated", 1 x "differentiation", 1 x "divisions", 1 x "emerged", 1 x "euglenids", 1 x "excluded", 1 x "family", 1 x "few", 1 x "filaments", 1 x "flagella", 1 x "flagellates", 1 x "flatworms", 1 x "for", 1 x "forms", 1 x "full", 1 x "fungal", 1 x "fungi"
0 讨论(0)
发布评论:

提交评论
- 加载中...

面向向阳花

2021-01-06 01:07

The way to achieve this is to take the initial text, and split it by whitespace into a string array using string.split(' ');

Next, you need to iterate over every string in the array. This is easy for single words, but more complex for groups. For this reason, you need to define a group size. You must control the number of places in the array the pointer advances for each iteration.

Once you are able to iterate the array, you need to grab the group of words (however long you have chosen it to be), and store it somewhere. Your dictionary in the example is a good approach.

If the dictionary contains the word group, you increment its value by one. If it doesn't contain the group, just add it with a default value of 1.

 if (wordList.ContainsKey(theKey)) {
   wordList[theKey]++;
 } else {
   wordList.Add(theKey, 1);
 }

You do rightly mention that your research showed that regex is not high performance. For this task, regex is completely the wrong tool - you're not looking for patterns, you're examining groups. For that, you have to go through the text from start to finish, checking values.

Any task that involves iterating through text and running a repeating function on it should never use regex.

EDIT: My initial assumption of the performance of Regex was not correct - in C#, it seems to be a great deal faster than in Java, yet I would still maintain that a pure regex approach is not as quick as using regex to tokenise the text then use either a loop or linq expression to find the groups.

Stating

@galakt As I mentioned above, let's say 3. Does it matter?

The idea of a word group is entirely abstract. Yes, it's a group of words, but the whole block of text is a group of words. Rules have to be applied to govern how you act on that group of words.

Below is a sample method which will return a dictionary of all the word groups based on a size passed via the method call. It does not strip any non-alphanumeric chars from the text, but it is fast, even with larger group sizes.

To call it, use Dictionary<String, int> SingleWordGroups = GetWordGroupInstances(1);

    private Dictionary<String, int> GetWordGroupInstances(int GroupSize) {

        Dictionary<String, int> WordGroupInstances = new Dictionary<string, int>();

        //Grab the string to work from...
        String[] sourceText = GetSourceText().Split(' ');
        int pointer = 0;
        StringBuilder groupBuilder = new StringBuilder();
        while (pointer < sourceText.Length - GroupSize) {
            groupBuilder.Clear();
            int offset = pointer + GroupSize;
            for (int i = pointer; i < offset; i++) {
                //prepend a space to allow separation between words in groups. 
                //We can make a substring from this later starting from char 1 
                //to lose the initial whitespace from the string.
                groupBuilder.Append(" ").Append(sourceText[i]);
            }

            String key = groupBuilder.ToString().Substring(1);
            if (!WordGroupInstances.ContainsKey(key)) {
                WordGroupInstances.Add(key, 1);
            } else {
                WordGroupInstances[key]++;
            }

            /**
             * Setting the pointer to increase by group size grabs a group, moves on
             * to the end of the group, and grabs the next.
             * 
             */
            pointer += GroupSize;

            /**
             * Setting the point to increment by 1 grabs a group, advances by 1 word, then
             * grabs the next, so from the phrase - "Hello world, I'm some text", the groups of size 2 would be
             * "Hello world,", "world, I'm", "I'm some" etc...
             */
            //pointer++;
        }

        return WordGroupInstances;

    }

The method below is modified to produce all the group output in turn, so The The Green Green Algae The Green Algae etc...

It's worth noting that the entire text must be converted to either lower or upper case so that words are not case dependent. I've refined this a little to improve the performance (and remove the need for the break instruction).

   private Dictionary<String, int> GetAllGroups() {
        Dictionary<string, int> WordGroupInstances = new Dictionary<string, int>();
        StringBuilder groupBuilder = new StringBuilder();
        String[] sourceText = GetSourceText().Split(' ');

        for (int i = 0; i < sourceText.Length; i++) {
            groupBuilder.Clear();
            for (int j = i; j < sourceText.Length; j++) {
                groupBuilder.Append(" ").Append(sourceText[j]);
                String key = groupBuilder.ToString().Substring(1);
                if (!WordGroupInstances.ContainsKey(key)) {
                    WordGroupInstances.Add(key, 1);
                } else {
                    WordGroupInstances[key]++;
                }
            }
        }
        return WordGroupInstances;
    }

After performance testing with the corpus of text (288 words), it will create the 41773 word groups in 0.171886 seconds.

0 讨论(0)

热议问题