What is the best algorithm for arbitrary delimiter/escape character processing?

后端未结

关注

 7  1846

I\'m a little surprised that there isn\'t some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here\'s the r

相关标签:

7条回答

悲&欢浪女

2021-01-02 08:12

You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

0 讨论(0)
发布评论:

提交评论
- 加载中...

旧巷少年郎

2021-01-02 08:15

Here's a more idiomatic and readable way to do it:

public IEnumerable<string> SplitAndUnescape(
    string encodedString,
    char separator,
    char escape)
{
    var inEscapeSequence = false;
    var currentToken = new StringBuilder();

    foreach (var currentCharacter in encodedString)
        if (inEscapeSequence)
        {
            currentToken.Append(currentCharacter);
            inEscapeSequence = false;
        }
        else
            if (currentCharacter == escape)
                inEscapeSequence = true;
            else
                if (currentCharacter == separator)
                {
                    yield return currentToken.ToString();
                    currentToken.Clear();
                }
                else
                    currentToken.Append(currentCharacter);

    yield return currentToken.ToString();
}

Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.

I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.

0 讨论(0)

名媛妹妹

2021-01-02 08:16

A simple state machine is usually the easiest and fastest way. Example in Python:

def extract(input, delim, escape):
  # states
  parsing = 0
  escaped = 1

  state = parsing
  found = []
  parsed = ""

  for c in input:
    if state == parsing:
      if c == delim:
        found.append(parsed)
        parsed = ""
      elif c == escape:
        state = escaped
      else:
        parsed += c
    else: # state == escaped
       parsed += c
       state = parsing

  if parsed:
    found.append(parsed)

  return found

0 讨论(0)

执笔经年

2021-01-02 08:17

void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
{
    enum State { NORMAL, IN_ESC };
    State state = NORMAL;
    string frag;

    for (size_t i = 0; i<text.length(); ++i)
    {
        char c = text[i];
        switch (state)
        {
        case NORMAL:
            if (c == delim)
            {
                if (!frag.empty())
                    tokens.push_back(frag);
                frag.clear();
            }
            else if (c == esc)
                state = IN_ESC;
            else
                frag.append(1, c);
            break;
        case IN_ESC:
            frag.append(1, c);
            state = NORMAL;
            break;
        }
    }
    if (!frag.empty())
        tokens.push_back(frag);
}

0 讨论(0)

逝去的感伤

2021-01-02 08:20

Here's my ported function in C#

    public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
    {
        bool currentlyEscaped = false;
        StringBuilder fragment = new StringBuilder();

        for (int i = 0; i < text.Length; i++)
        {
            char c = text[i];
            if (currentlyEscaped)
            {
                fragment.Append(c);
                currentlyEscaped = false;
            }
            else 
            {
                if (c == delim)
                {
                    if (fragment.Length > 0)
                    {
                        listToBuild.Add(fragment.ToString());
                        fragment.Remove(0, fragment.Length);
                    }

                }
                else if (c == esc)
                    currentlyEscaped = true;
                else
                    fragment.Append(c);
            }
        }

        if (fragment.Length > 0)
        {
            listToBuild.Add(fragment.ToString());
        }
    }

Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.

0 讨论(0)

春和景丽

2021-01-02 08:31
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.

You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).

Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
```
state(input)     action
========================
BEGIN(*):         token.clear(); state=START;
END(*):           return;
*(\n\0):          token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE):    state=ESC; // NB: the input is *not* added to the token!
START(*):         token.append(input); state=NORM;
NORM(DELIMITER):  token.emit(); token.clear(); state=START;
NORM(ESCAPE):     state=ESC; // NB: the input is *not* added to the token!
NORM(*):          token.append(input);
ESC(*):           token.append(input); state=NORM;
```
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页