C# - Split Fully Uppercase String Into Separate Words (No Spaces)

问题

Im currently working on a project where I will need to separate individual words from a string. The catch is that all the words in the string are capitalized and have no spaces. The following is an example of the kind of input the program is receiving:

"COMPUTERFIVECODECOLOR"

This should be split into the following result:

"COMPUTER" "FIVE" "CODE" "COLOR"

So far, I have been using the following method to split my strings (and it has worked for all scenarios except this edge case):

private static List<string> NormalizeSections(List<string> wordList)
        {
            var modifiedList = new List<string>();
            foreach (var word in wordList)
            {
                int index = wordList.IndexOf(word);
                var split = Regex.Split(word, @"(\p{Lu}\p{Ll}+)").ToList();
                split.RemoveAll(i => i == "");

                modifiedList.AddRange(split);
            }
            return modifiedList;
        }

If anyone has any ideas on how to handle this, I would be more than happy to hear them. Also, please let me know if I can provide additional information.

回答1:

I am making some assumptions on how you want to search for matching words. Firstly, at a given character index, preference will be given to the longest matching word in the dictionary. Secondly, if at a given character index no word is found, we move on to the next character and search again.

The implementation below uses a Trie to index the dictionary of all valid words. Rather than looping through each word in the dictionary, we then progress through each character in the input string, looking for the longest word.

I lifted the implementation of the Trie in C# from this very handy SO answer: https://stackoverflow.com/a/6073004

Edit: fixed a bug in the Trie when adding a word which is a substring of existing word, such as Emergency then Emerge.

The code is available on DotNetFiddle.

using System;
using System.Collections.Generic;

public class Program
{
    public static void Main()
    {

        var words = new[] { "COMPUTE", "FIVE", "CODE", "COLOR", "PUT", "EMERGENCY", "MERGE", "EMERGE" };

        var trie = new Trie(words);

        var input = "COMPUTEEMERGEFIVECODECOLOR";

        for (var charIndex = 0; charIndex < input.Length; charIndex++)
        {
            var longestWord = FindLongestWord(trie.Root, input, charIndex);

            if (longestWord == null)
            {
                Console.WriteLine("No word found at char index {0}", charIndex);
            }
            else
            {
                Console.WriteLine("Found {0} at char index {1}", longestWord, charIndex);

                charIndex += longestWord.Length - 1;
            }
        }

    }

    static private string FindLongestWord(Trie.Node node, string input, int charIndex)
    {
        var character = char.ToUpper(input[charIndex]);

        string longestWord = null;

        foreach (var edge in node.Edges)
        {
            if (edge.Key.ToChar() == character)
            {
                var foundWord = edge.Value.Word;

                if (!edge.Value.IsTerminal)
                {
                    var longerWord = FindLongestWord(edge.Value, input, charIndex + 1);

                    if (longerWord != null) foundWord = longerWord;
                }

                if (foundWord != null && (longestWord == null || edge.Value.Word.Length > longestWord.Length))
                {
                    longestWord = foundWord;
                }
            }
        }

        return longestWord;
    }
}

//Trie taken from: https://stackoverflow.com/a/6073004
public struct Letter
{
    public const string Chars = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
    public static implicit operator Letter(char c)
    {
        return new Letter() { Index = Chars.IndexOf(c) };
    }
    public int Index;
    public char ToChar()
    {
        return Chars[Index];
    }
    public override string ToString()
    {
        return Chars[Index].ToString();
    }
}

public class Trie
{
    public class Node
    {
        public string Word;
        public bool IsTerminal { get { return Edges.Count == 0 && Word != null; } }
        public Dictionary<Letter, Node> Edges = new Dictionary<Letter, Node>();
    }

    public Node Root = new Node();

    public Trie(string[] words)
    {
        for (int w = 0; w < words.Length; w++)
        {
            var word = words[w];
            var node = Root;
            for (int len = 1; len <= word.Length; len++)
            {
                var letter = word[len - 1];
                Node next;
                if (!node.Edges.TryGetValue(letter, out next))
                {
                    next = new Node();

                    node.Edges.Add(letter, next);
                }

                if (len == word.Length)
                {
                    next.Word = word;
                }

                node = next;
            }
        }
    }

}

Output is:

Found COMPUTE at char index 0
Found EMERGE at char index 7
Found FIVE at char index 13
Found CODE at char index 17    
Found COLOR at char index 21

回答2:

Assuming the words in the dictionary do not contain each other (e.g. "TOO" and "TOOK"), I fail to see why this problem requires a solution that is any more complicated than this one-line function:

static public List<string> Normalize(string input, List<string> dictionary)
{
    return dictionary.Where(a => input.Contains(a)).ToList();       
}

(If the words DO contain each other, see below.)

Full example:

using System;
using System.Linq;
using System.Collections.Generic;

public class Program
{
    static public List<string> Normalize(string input, List<string> dictionary)
    {
        return dictionary.Where(a => input.Contains(a)).ToList();       
    }

    public static void Main()
    {
        List<string> dictionary = new List<string>
        {
            "COMPUTER","FIVE","CODE","COLOR","FOO"
        };
        string input = "COMPUTERFIVECODECOLORBAR";
        var normalized = Normalize(input, dictionary);
        foreach (var s in normalized)
        {
            Console.WriteLine(s);
        }
    }
}

Output:

COMPUTER
FIVE
CODE
COLOR

Code on DotNetFiddle

On the other hand, if you've determined that your keywords DO in fact overlap, you're not totally out of luck. If you are certain that the input string contains only words that are in the dictionary, and that they are continguous, you can use a more complicated function.

    static public List<string> Normalize2(string input, List<string> dictionary)
    {
        var sorted = dictionary.OrderByDescending( a => a.Length).ToList();
        var results = new List<string>();
        bool found = false;

        do
        {
            found = false;
            foreach (var s in sorted)
            {
                if (input.StartsWith(s))
                {
                    found = true;
                    results.Add(s);
                    input = input.Substring(s.Length);
                    break;
                }
            }
        }
        while (input != "" && found);

        return results;
    }

    public static void Main()
    {
        List<string> dictionary = new List<string>
        {
            "SHORT","LONG","LONGER","FOO","FOOD"
        };
        string input = "FOODSHORTLONGERFOO";
        var normalized = Normalize2(input, dictionary);
        foreach (var s in normalized)
        {
            Console.WriteLine(s);
        }
    }

The way this works is that it only looks at the beginning of the string and looks for the longest keywords first. When one is found, it removes it from the input string and continues searching.

Output:

FOOD
SHORT
LONGER
FOO

Notice that "LONG" is not included because we included "LONGER", but "FOO" is included because it is in the string separate from "FOOD".

Also, with this second solution, the keywords will appear in the results dictionary in the same order they appeared in the original string. So if the requirement was to actually split the phrase rather than just detect the keywords in any order, you should use the second function.

Code

来源：https://stackoverflow.com/questions/47276641/c-sharp-split-fully-uppercase-string-into-separate-words-no-spaces

标签

string

split

text-parsing