What is the best algorithm for arbitrary delimiter/escape character processing?

后端 未结 7 1844
心在旅途
心在旅途 2021-01-02 07:57

I\'m a little surprised that there isn\'t some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here\'s the r

相关标签:
7条回答
  • 2021-01-02 08:12

    You'ew looking for something like a "string tokenizer". There's a version I found quickly that's similar. Or look at getopt.

    0 讨论(0)
  • 2021-01-02 08:15

    Here's a more idiomatic and readable way to do it:

    public IEnumerable<string> SplitAndUnescape(
        string encodedString,
        char separator,
        char escape)
    {
        var inEscapeSequence = false;
        var currentToken = new StringBuilder();
    
        foreach (var currentCharacter in encodedString)
            if (inEscapeSequence)
            {
                currentToken.Append(currentCharacter);
                inEscapeSequence = false;
            }
            else
                if (currentCharacter == escape)
                    inEscapeSequence = true;
                else
                    if (currentCharacter == separator)
                    {
                        yield return currentToken.ToString();
                        currentToken.Clear();
                    }
                    else
                        currentToken.Append(currentCharacter);
    
        yield return currentToken.ToString();
    }
    

    Note that this doesn't remove empty elements. I don't think that should be the responsibility of the parser. If you want to remove them, just call Where(item => item.Any()) on the result.

    I think this is too much logic for a single method; it gets hard to follow. If someone has time, I think it would be better to break it up into multiple methods and maybe its own class.

    0 讨论(0)
  • 2021-01-02 08:16

    A simple state machine is usually the easiest and fastest way. Example in Python:

    def extract(input, delim, escape):
      # states
      parsing = 0
      escaped = 1
    
      state = parsing
      found = []
      parsed = ""
    
      for c in input:
        if state == parsing:
          if c == delim:
            found.append(parsed)
            parsed = ""
          elif c == escape:
            state = escaped
          else:
            parsed += c
        else: # state == escaped
           parsed += c
           state = parsing
    
      if parsed:
        found.append(parsed)
    
      return found
    
    0 讨论(0)
  • 2021-01-02 08:17
    void smartSplit(string const& text, char delim, char esc, vector<string>& tokens)
    {
        enum State { NORMAL, IN_ESC };
        State state = NORMAL;
        string frag;
    
        for (size_t i = 0; i<text.length(); ++i)
        {
            char c = text[i];
            switch (state)
            {
            case NORMAL:
                if (c == delim)
                {
                    if (!frag.empty())
                        tokens.push_back(frag);
                    frag.clear();
                }
                else if (c == esc)
                    state = IN_ESC;
                else
                    frag.append(1, c);
                break;
            case IN_ESC:
                frag.append(1, c);
                state = NORMAL;
                break;
            }
        }
        if (!frag.empty())
            tokens.push_back(frag);
    }
    
    0 讨论(0)
  • 2021-01-02 08:20

    Here's my ported function in C#

        public static void smartSplit(string text, char delim, char esc, ref List<string> listToBuild)
        {
            bool currentlyEscaped = false;
            StringBuilder fragment = new StringBuilder();
    
            for (int i = 0; i < text.Length; i++)
            {
                char c = text[i];
                if (currentlyEscaped)
                {
                    fragment.Append(c);
                    currentlyEscaped = false;
                }
                else 
                {
                    if (c == delim)
                    {
                        if (fragment.Length > 0)
                        {
                            listToBuild.Add(fragment.ToString());
                            fragment.Remove(0, fragment.Length);
                        }
    
                    }
                    else if (c == esc)
                        currentlyEscaped = true;
                    else
                        fragment.Append(c);
                }
            }
    
            if (fragment.Length > 0)
            {
                listToBuild.Add(fragment.ToString());
            }
        }
    

    Hope this helps someone in the future. Thanks to KenE for pointing me in the right direction.

    0 讨论(0)
  • 2021-01-02 08:31

    The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.

    You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).


    Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:

    state(input)     action
    ========================
    BEGIN(*):         token.clear(); state=START;
    END(*):           return;
    *(\n\0):          token.emit(); state=END;
    START(DELIMITER): ; // NB: the input is *not* added to the token!
    START(ESCAPE):    state=ESC; // NB: the input is *not* added to the token!
    START(*):         token.append(input); state=NORM;
    NORM(DELIMITER):  token.emit(); token.clear(); state=START;
    NORM(ESCAPE):     state=ESC; // NB: the input is *not* added to the token!
    NORM(*):          token.append(input);
    ESC(*):           token.append(input); state=NORM;
    

    This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).

    0 讨论(0)
提交回复
热议问题