How to parse a comma delimited string when comma and parenthesis exists in field

感情迁移 提交于 2019-12-02 01:28:54
string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
var resultStrings = new List<string>();
int? firstIndex = null;
int scopeLevel = 0;
for (int i = 0; i < str.Length; i++)
{
    if (str[i] == ',' && scopeLevel == 0)
    {
        resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault(), i - firstIndex.GetValueOrDefault()));
        firstIndex = i + 1;
    }
    else if (str[i] == '(') scopeLevel++;
    else if (str[i] == ')') scopeLevel--;
}
resultStrings.Add(str.Substring(firstIndex.GetValueOrDefault()));

Event faster:

([^,]*\x28[^\x29]*\x29|[^,]+)

That should do the trick. Basically, look for either a "function thumbprint" or anything without a comma.

adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO
                  ^                   ^  ^      ^                  ^

The Carets symbolize where the grouping stops.

Just this regex:

[^,()]+(\([^()]*\))?

A test example:

var s= "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
Regex regex = new Regex(@"[^,()]+(\([^()]*\))?");
var matches = regex.Matches(s)
    .Cast<Match>()
    .Select(m => m.Value);

returns

adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
 NG/CL
 5 value of CL(JK)
 HO

If you simply must use Regex, then you can split the string on the following:

,                # match a comma
(?=              # that is followed by
  (?:            # either
    [^\(\)]*     #  no parens at all
    |            # or
    (?:          #  
      [^\(\)]*   #  ...
      \(         #  (
      [^\(\)]*   #     stuff in parens
      \)         #  )
      [^\(\)]*   #  ...
    )+           #  any number of times
  )$             # until the end of the string
)

It breaks your input into the following:

adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO

You can also use .NET's balanced grouping constructs to create a version that works with nested parens, but you're probably just as well off with one of the non-Regex solutions.

David Wallace

Another way to implement what Snowbear was doing:

    public static string[] SplitNest(this string s, char src, string nest, string trg)
    {
        int scope = 0;
        if (trg == null || nest == null) return null;
        if (trg.Length == 0 || nest.Length < 2) return null;
        if (trg.IndexOf(src) >= 0) return null;
        if (nest.IndexOf(src) >= 0) return null;

        for (int i = 0; i < s.Length; i++)
        {
            if (s[i] == src && scope == 0)
            {
                s = s.Remove(i, 1).Insert(i, trg);
            }
            else if (s[i] == nest[0]) scope++;
            else if (s[i] == nest[1]) scope--;
        }

        return s.Split(trg);
    }

The idea is to replace any non-nested delimiter with another delimiter that you can then use with an ordinary string.Split(). You can also choose what type of bracket to use - (), <>, [], or even something weird like \/, ][, or `'. For your purposes you would use

string str = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";
string[] result = str.SplitNest(',',"()","~");

The function would first turn your string into

adj_con(CL2,1,3,0)~adj_cont(CL1,1,3,0)~NG~ NG/CL~ 5 value of CL(JK)~ HO

then split on the ~, ignoring the nested commas.

Assuming non nested, matching parentheses, you can easily match the tokens you want instead of splitting the string:

MatchCollection matches = Regex.Matches(data, @"(?:[^(),]|\([^)]*\))+");
var s = "adj_con(CL2,1,3,0),adj_cont(CL1,1,3,0),NG, NG/CL, 5 value of CL(JK), HO";  
var result = string.Join(@"\n",Regex.Split(s, @"(?<=\)),|,\s"));  

The pattern matches for ) and excludes it from the match then matches , or matches , followed by a space.

result =

adj_con(CL2,1,3,0)
adj_cont(CL1,1,3,0)
NG
NG/CL
5 value of CL(JK)
HO

sehe

The TextFieldParser (msdn) class seems to have the functionality built-in:

TextFieldParser Class: - Provides methods and properties for parsing structured text files.

Parsing a text file with the TextFieldParser is similar to iterating over a text file, while the ReadFields method to extract fields of text is similar to splitting the strings.

The TextFieldParser can parse two types of files: delimited or fixed-width. Some properties, such as Delimiters and HasFieldsEnclosedInQuotes are meaningful only when working with delimited files, while the FieldWidths property is meaningful only when working with fixed-width files.

See the article which helped me find that

Here's a stronger option, which parses the whole text, including nested parentheses:

string pattern = @"
\A
(?>
    (?<Token>
        (?:
            [^,()]              # Regular character
            |
            (?<Paren> \( )      # Opening paren - push to stack
            |
            (?<-Paren> \) )     # Closing paren - pop
            |
            (?(Paren),)         # If inside parentheses, match comma.
        )*?
    )
    (?(Paren)(?!))    # If we are not inside parentheses,
    (?:,|\Z)          # match a comma or the end
)*? # lazy just to avoid an extra empty match at the end,
    #  though it removes a last empty token.
\Z
";
Match match = Regex.Match(data, pattern, RegexOptions.IgnorePatternWhitespace);

You can get all matches by iterating over match.Groups["Token"].Captures.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!