Best way to parse Space Separated Text

谁说胖子不能爱 提交于 2019-12-03 13:13:37
Todd Myhre

The computer term for what you're doing is lexical analysis; read that for a good summary of this common task.

Based on your example, I'm guessing that you want whitespace to separate your words, but stuff in quotation marks should be treated as a "word" without the quotes.

The simplest way to do this is to define a word as a regular expression:

([^"^\s]+)\s*|"([^"]+)"\s*

This expression states that a "word" is either (1) non-quote, non-whitespace text surrounded by whitespace, or (2) non-quote text surrounded by quotes (followed by some whitespace). Note the use of capturing parentheses to highlight the desired text.

Armed with that regex, your algorithm is simple: search your text for the next "word" as defined by the capturing parentheses, and return it. Repeat that until you run out of "words".

Here's the simplest bit of working code I could come up with, in VB.NET. Note that we have to check both groups for data since there are two sets of capturing parentheses.

Dim token As String
Dim r As Regex = New Regex("([^""^\s]+)\s*|""([^""]+)""\s*")
Dim m As Match = r.Match("this is a ""test string""")

While m.Success
    token = m.Groups(1).ToString
    If token.length = 0 And m.Groups.Count > 1 Then
        token = m.Groups(2).ToString
    End If
    m = m.NextMatch
End While

Note 1: Will's answer, above, is the same idea as this one. Hopefully this answer explains the details behind the scene a little better :)

The Microsoft.VisualBasic.FileIO namespace (in Microsoft.VisualBasic.dll) has a TextFieldParser you can use to split on space delimeted text. It handles strings within quotes (i.e., "this is one token" thisistokentwo) well.

Note, just because the DLL says VisualBasic doesn't mean you can only use it in a VB project. Its part of the entire Framework.

There is the state machine approach.

    private enum State
    {
        None = 0,
        InTokin,
        InQuote
    }

    private static IEnumerable<string> Tokinize(string input)
    {
        input += ' '; // ensure we end on whitespace
        State state = State.None;
        State? next = null; // setting the next state implies that we have found a tokin
        StringBuilder sb = new StringBuilder();
        foreach (char c in input)
        {
            switch (state)
            {
                default:
                case State.None:
                    if (char.IsWhiteSpace(c))
                        continue;
                    else if (c == '"')
                    {
                        state = State.InQuote;
                        continue;
                    }
                    else
                        state = State.InTokin;
                    break;
                case State.InTokin:
                    if (char.IsWhiteSpace(c))
                        next = State.None;
                    else if (c == '"')
                        next = State.InQuote;
                    break;
                case State.InQuote:
                    if (c == '"')
                        next = State.None;
                    break;
            }
            if (next.HasValue)
            {
                yield return sb.ToString();
                sb = new StringBuilder();
                state = next.Value;
                next = null;
            }
            else
                sb.Append(c);
        }
    }

It can easily be extended for things like nested quotes and escaping. Returning as IEnumerable<string> allows your code to only parse as much as you need. There aren't any real downsides to that kind of lazy approach as strings are immutable so you know that input isn't going to change before you have parsed the whole thing.

See: http://en.wikipedia.org/wiki/Automata-Based_Programming

You also might want to look into regular expressions. That might help you out. Here is a sample ripped off from MSDN...

using System;
using System.Text.RegularExpressions;

public class Test
{

    public static void Main ()
    {

        // Define a regular expression for repeated words.
        Regex rx = new Regex(@"\b(?<word>\w+)\s+(\k<word>)\b",
          RegexOptions.Compiled | RegexOptions.IgnoreCase);

        // Define a test string.        
        string text = "The the quick brown fox  fox jumped over the lazy dog dog.";

        // Find matches.
        MatchCollection matches = rx.Matches(text);

        // Report the number of matches found.
        Console.WriteLine("{0} matches found in:\n   {1}", 
                          matches.Count, 
                          text);

        // Report on each match.
        foreach (Match match in matches)
        {
            GroupCollection groups = match.Groups;
            Console.WriteLine("'{0}' repeated at positions {1} and {2}",  
                              groups["word"].Value, 
                              groups[0].Index, 
                              groups[1].Index);
        }

    }

}
// The example produces the following output to the console:
//       3 matches found in:
//          The the quick brown fox  fox jumped over the lazy dog dog.
//       'The' repeated at positions 0 and 4
//       'fox' repeated at positions 20 and 25
//       'dog' repeated at positions 50 and 54
harpo

Craig is right — use regular expressions. Regex.Split may be more concise for your needs.

[^\t]+\t|"[^"]+"\t

using the Regex definitely looks like the best bet, however this one just returns the whole string. I'm trying to tweak it, but not much luck so far.

string[] tokens = System.Text.RegularExpressions.Regex.Split(this.BuildArgs, @"[^\t]+\t|""[^""]+""\t");
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!