Split string by delimiter, but not if it is escaped

前端 未结 5 1564
旧巷少年郎
旧巷少年郎 2020-11-27 10:58

How can I split a string by a delimiter, but not if it is escaped? For example, I have a string:

1|2\\|2|3\\\\|4\\\\\\|4

The delimiter is <

相关标签:
5条回答
  • 2020-11-27 11:01

    For future readers, here is a universal solution. It is based on NikiC's idea with (*SKIP)(*FAIL):

    function split_escaped($delimiter, $escaper, $text)
    {
        $d = preg_quote($delimiter, "~");
        $e = preg_quote($escaper, "~");
        $tokens = preg_split(
            '~' . $e . '(' . $e . '|' . $d . ')(*SKIP)(*FAIL)|' . $d . '~',
            $text
        );
        $escaperReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $escaper);
        $delimiterReplacement = str_replace(['\\', '$'], ['\\\\', '\\$'], $delimiter);
        return preg_replace(
            ['~' . $e . $e . '~', '~' . $e . $d . '~'],
            [$escaperReplacement, $delimiterReplacement],
            $tokens
        );
    }
    

    Make a try:

    // the base situation:
    $text = "asdf\\,fds\\,ddf,\\\\,f\\,,dd";
    $delimiter = ",";
    $escaper = "\\";
    print_r(split_escaped($delimiter, $escaper, $text));
    
    // other signs:
    $text = "dk!%fj%slak!%df!!jlskj%%dfl%isr%!%%jlf";
    $delimiter = "%";
    $escaper = "!";
    print_r(split_escaped($delimiter, $escaper, $text));
    
    // delimiter with multiple characters:
    $text = "aksd()jflaksd())jflkas(('()j()fkl'()()as()d('')jf";
    $delimiter = "()";
    $escaper = "'";
    print_r(split_escaped($delimiter, $escaper, $text));
    
    // escaper is same as delimiter:
    $text = "asfl''asjf'lkas'''jfkl''d'jsl";
    $delimiter = "'";
    $escaper = "'";
    print_r(split_escaped($delimiter, $escaper, $text));
    

    Output:

    Array
    (
        [0] => asdf,fds,ddf
        [1] => \
        [2] => f,
        [3] => dd
    )
    Array
    (
        [0] => dk%fj
        [1] => slak%df!jlskj
        [2] => 
        [3] => dfl
        [4] => isr
        [5] => %
        [6] => jlf
        )
    Array
    (
        [0] => aksd
        [1] => jflaksd
        [2] => )jfl'kas((()j
        [3] => fkl()
        [4] => as
        [5] => d(')jf
    )
    Array
    (
        [0] => asfl'asjf
        [1] => lkas'
        [2] => jfkl'd
        [3] => jsl
    )
    

    Note: There is a theoretical level problem: implode('::', ['a:', ':b']) and implode('::', ['a', '', 'b']) result the same string: 'a::::b'. Imploding can be also an interesting problem.

    0 讨论(0)
  • 2020-11-27 11:14

    Regex is painfully slow. A better method is removing escaped characters from the string prior to splitting then putting them back in:

    $foo = 'a,b|,c,d||,e';
    
    function splitEscaped($str, $delimiter,$escapeChar = '\\') {
        //Just some temporary strings to use as markers that will not appear in the original string
        $double = "\0\0\0_doub";
        $escaped = "\0\0\0_esc";
        $str = str_replace($escapeChar . $escapeChar, $double, $str);
        $str = str_replace($escapeChar . $delimiter, $escaped, $str);
    
        $split = explode($delimiter, $str);
        foreach ($split as &$val) $val = str_replace([$double, $escaped], [$escapeChar, $delimiter], $val);
        return $split;
    }
    
    print_r(splitEscaped($foo, ',', '|'));
    

    which splits on ',' but not if escaped with "|". It also supports double escaping so "||" becomes a single "|" after the split happens:

    Array ( [0] => a [1] => b,c [2] => d| [3] => e ) 
    
    0 讨论(0)
  • 2020-11-27 11:19

    Instead of split(...), it's IMO more intuitive to use some sort of "scan" function that operates like a lexical tokenizer. In PHP that would be the preg_match_all function. You simply say you want to match:

    1. something other than a \ or |
    2. or a \ followed by a \ or |
    3. repeat #1 or #2 at least once

    The following demo:

    $input = "1|2\\|2|3\\\\|4\\\\\\|4";
    echo $input . "\n\n";
    preg_match_all('/(?:\\\\.|[^\\\\|])+/', $input, $parts);
    print_r($parts[0]);
    

    will print:

    1|2\|2|3\\|4\\\|4
    
    Array
    (
        [0] => 1
        [1] => 2\|2
        [2] => 3\\
        [3] => 4\\\|4
    )
    
    0 讨论(0)
  • 2020-11-27 11:24

    Use dark magic:

    $array = preg_split('~\\\\.(*SKIP)(*FAIL)|\|~s', $string);
    

    \\\\. matches a backslash followed by a character, (*SKIP)(*FAIL) skips it and \| matches your delimiter.

    0 讨论(0)
  • 2020-11-27 11:26

    Recently I devised a solution:

    $array = preg_split('~ ((?<!\\\\)|(?<=[^\\\\](\\\\\\\\)+)) \| ~x', $string);
    

    But the black magic solution is still three times faster.

    0 讨论(0)
提交回复
热议问题