Removing all whitespace lines from a multi-line string efficiently

前端 未结 19 1984
名媛妹妹
名媛妹妹 2020-12-29 04:25

In C# what\'s the best way to remove blank lines i.e., lines that contain only whitespace from a string? I\'m happy to use a Regex if that\'s the best solution.

EDIT

相关标签:
19条回答
  • 2020-12-29 04:51
    string outputString;
    using (StringReader reader = new StringReader(originalString)
    using (StringWriter writer = new StringWriter())
    {
        string line;
        while((line = reader.ReadLine()) != null)
        {
            if (line.Trim().Length > 0)
                writer.WriteLine(line);
        }
        outputString = writer.ToString();
    }
    
    0 讨论(0)
  • 2020-12-29 04:52

    In response to Will's bounty, which expects a solution that takes "test\r\n \r\nthis\r\n\r\n" and outputs "test\r\nthis", I've come up with a solution that makes use of atomic grouping (aka Nonbacktracking Subexpressions on MSDN). I recommend reading those articles for a better understanding of what's happening. Ultimately the atomic group helped match the trailing newline characters that were otherwise left behind.

    Use RegexOptions.Multiline with this pattern:

    ^\s+(?!\B)|\s*(?>[\r\n]+)$
    

    Here is an example with some test cases, including some I gathered from Will's comments on other posts, as well as my own.

    string[] inputs = 
    {
        "one\r\n \r\ntwo\r\n\t\r\n \r\n",
        "test\r\n \r\nthis\r\n\r\n",
        "\r\n\r\ntest!",
        "\r\ntest\r\n ! test",
        "\r\ntest \r\n ! "
    };
    string[] outputs = 
    {
        "one\r\ntwo",
        "test\r\nthis",
        "test!",
        "test\r\n ! test",
        "test \r\n ! "
    };
    
    string pattern = @"^\s+(?!\B)|\s*(?>[\r\n]+)$";
    
    for (int i = 0; i < inputs.Length; i++)
    {
        string result = Regex.Replace(inputs[i], pattern, "",
                                      RegexOptions.Multiline);
        Console.WriteLine(result == outputs[i]);
    }
    

    EDIT: To address the issue of the pattern failing to clean up text with a mix of whitespace and newlines, I added \s* to the last alternation portion of the regex. My previous pattern was redundant and I realized \s* would handle both cases.

    0 讨论(0)
  • 2020-12-29 04:53
    string corrected = 
        System.Text.RegularExpressions.Regex.Replace(input, @"\n+", "\n");
    
    0 讨论(0)
  • 2020-12-29 04:56

    Eh. Well, after all that, I couldn't find one that would hit all the corner cases I could figure out. The following is my latest incantation of a regex that strips

    1. All empty lines from the start of a string
      • Not including any spaces at the beginning of the first non-whitespace line
    2. All empty lines after the first non-whitespace line and before the last non-whitespace line
      • Again, preserving all whitespace at the beginning of any non-whitespace line
    3. All empty lines after the last non-whitespace line, including the last newline

    (?<=(\r\n)|^)\s*\r\n|\r\n\s*$

    which essentially says:

    • Immediately after
      • The beginning of the string OR
      • The end of the last line
    • Match as much contiguous whitespace as possible that ends in a newline*
    • OR
    • Match a newline and as much contiguous whitespace as possible that ends at the end of the string

    The first half catches all whitespace at the start of the string until the first non-whitespace line, or all whitespace between non-whitespace lines. The second half snags the remaining whitespace in the string, including the last non-whitespace line's newline.

    Thanks to all who tried to help out; your answers helped me think through everything I needed to consider when matching.

    *(This regex considers a newline to be \r\n, and so will have to be adjusted depending on the source of the string. No options need to be set in order to run the match.)

    0 讨论(0)
  • 2020-12-29 04:57
    s = Regex.Replace(s, @"^[^\n\S]*\n", "");
    

    [^\n\S] matches any character that's not a linefeed or a non-whitespace character--so, any whitespace character except \n. But most likely the only characters you have to worry about are space, tab and carriage return, so this should work too:

    s = Regex.Replace(s, @"^[ \t\r]*\n", "");
    

    And if you want it to catch the last line, without a final linefeed:

    s = Regex.Replace(s, @"^[ \t\r]*\n?", "");
    
    0 讨论(0)
  • 2020-12-29 04:58

    If you want to remove lines containing any whitespace (tabs, spaces), try:

    string fix = Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline);
    

    Edit (for @Will): The simplest solution to trim trailing newlines would be to use TrimEnd on the resulting string, e.g.:

    string fix =
        Regex.Replace(original, @"^\s*$\n", string.Empty, RegexOptions.Multiline)
             .TrimEnd();
    
    0 讨论(0)
提交回复
热议问题