Best way to split string into lines

前端 未结 10 796
情书的邮戳
情书的邮戳 2020-11-27 13:53

How do you split multi-line string into lines?

I know this way

var result = input.Split(\"\\n\\r\".ToCharArray(), StringSplitOptions.RemoveEmptyEntri         


        
相关标签:
10条回答
  • 2020-11-27 14:25

    It's tricky to handle mixed line endings properly. As we know, the line termination characters can be "Line Feed" (ASCII 10, \n, \x0A, \u000A), "Carriage Return" (ASCII 13, \r, \x0D, \u000D), or some combination of them. Going back to DOS, Windows uses the two-character sequence CR-LF \u000D\u000A, so this combination should only emit a single line. Unix uses a single \u000A, and very old Macs used a single \u000D character. The standard way to treat arbitrary mixtures of these characters within a single text file is as follows:

    • each and every CR or LF character should skip to the next line EXCEPT...
    • ...if a CR is immediately followed by LF (\u000D\u000A) then these two together skip just one line.
    • String.Empty is the only input that returns no lines (any character entails at least one line)
    • The last line must be returned even if it has neither CR nor LF.

    The preceding rule describes the behavior of StringReader.ReadLine and related functions, and the function shown below produces identical results. It is an efficient C# line breaking function that dutifully implements these guidelines to correctly handle any arbitrary sequence or combination of CR/LF. The enumerated lines do not contain any CR/LF characters. Empty lines are preserved and returned as String.Empty.

    /// <summary>
    /// Enumerates the text lines from the string.
    ///   ⁃ Mixed CR-LF scenarios are handled correctly
    ///   ⁃ String.Empty is returned for each empty line
    ///   ⁃ No returned string ever contains CR or LF
    /// </summary>
    public static IEnumerable<String> Lines(this String s)
    {
        int j = 0, c, i;
        char ch;
        if ((c = s.Length) > 0)
            do
            {
                for (i = j; (ch = s[j]) != '\r' && ch != '\n' && ++j < c;)
                    ;
    
                yield return s.Substring(i, j - i);
            }
            while (++j < c && (ch != '\r' || s[j] != '\n' || ++j < c));
    }
    

    Note: If you don't mind the overhead of creating a StringReader instance on each call, you can use the following C# 7 code instead. As noted, while the example above may be slightly more efficient, both of these functions produce the exact same results.

    public static IEnumerable<String> Lines(this String s)
    {
        using (var tr = new StringReader(s))
            while (tr.ReadLine() is String L)
                yield return L;
    }
    
    0 讨论(0)
  • 2020-11-27 14:26

    I had this other answer but this one, based on Jack's answer, is significantly faster might be preferred since it works asynchronously, although slightly slower.

    public static class StringExtensionMethods
    {
        public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
        {
            using (var sr = new StringReader(str))
            {
                string line;
                while ((line = sr.ReadLine()) != null)
                {
                    if (removeEmptyLines && String.IsNullOrWhiteSpace(line))
                    {
                        continue;
                    }
                    yield return line;
                }
            }
        }
    }
    

    Usage:

    input.GetLines()      // keeps empty lines
    
    input.GetLines(true)  // removes empty lines
    

    Test:

    Action<Action> measure = (Action func) =>
    {
        var start = DateTime.Now;
        for (int i = 0; i < 100000; i++)
        {
            func();
        }
        var duration = DateTime.Now - start;
        Console.WriteLine(duration);
    };
    
    var input = "";
    for (int i = 0; i < 100; i++)
    {
        input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
    }
    
    measure(() =>
        input.Split(new[] { "\r\n", "\r", "\n" }, StringSplitOptions.None)
    );
    
    measure(() =>
        input.GetLines()
    );
    
    measure(() =>
        input.GetLines().ToList()
    );
    

    Output:

    00:00:03.9603894

    00:00:00.0029996

    00:00:04.8221971

    0 讨论(0)
  • 2020-11-27 14:28
    using (StringReader sr = new StringReader(text)) {
        string line;
        while ((line = sr.ReadLine()) != null) {
            // do something
        }
    }
    
    0 讨论(0)
  • 2020-11-27 14:28

    If you want to keep empty lines just remove the StringSplitOptions.

    var result = input.Split(System.Environment.NewLine.ToCharArray());
    
    0 讨论(0)
  • 2020-11-27 14:36

    Update: See here for an alternative/async solution.


    This works great and is faster than Regex:

    input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
    

    It is important to have "\r\n" first in the array so that it's taken as one line break. The above gives the same results as either of these Regex solutions:

    Regex.Split(input, "\r\n|\r|\n")
    
    Regex.Split(input, "\r?\n|\r")
    

    Except that Regex turns out to be about 10 times slower. Here's my test:

    Action<Action> measure = (Action func) => {
        var start = DateTime.Now;
        for (int i = 0; i < 100000; i++) {
            func();
        }
        var duration = DateTime.Now - start;
        Console.WriteLine(duration);
    };
    
    var input = "";
    for (int i = 0; i < 100; i++)
    {
        input += "1 \r2\r\n3\n4\n\r5 \r\n\r\n 6\r7\r 8\r\n";
    }
    
    measure(() =>
        input.Split(new[] {"\r\n", "\r", "\n"}, StringSplitOptions.None)
    );
    
    measure(() =>
        Regex.Split(input, "\r\n|\r|\n")
    );
    
    measure(() =>
        Regex.Split(input, "\r?\n|\r")
    );
    

    Output:

    00:00:03.8527616

    00:00:31.8017726

    00:00:32.5557128

    and here's the Extension Method:

    public static class StringExtensionMethods
    {
        public static IEnumerable<string> GetLines(this string str, bool removeEmptyLines = false)
        {
            return str.Split(new[] { "\r\n", "\r", "\n" },
                removeEmptyLines ? StringSplitOptions.RemoveEmptyEntries : StringSplitOptions.None);
        }
    }
    

    Usage:

    input.GetLines()      // keeps empty lines
    
    input.GetLines(true)  // removes empty lines
    
    0 讨论(0)
  • 2020-11-27 14:37

    Slightly twisted, but an iterator block to do it:

    public static IEnumerable<string> Lines(this string Text)
    {
        int cIndex = 0;
        int nIndex;
        while ((nIndex = Text.IndexOf(Environment.NewLine, cIndex + 1)) != -1)
        {
            int sIndex = (cIndex == 0 ? 0 : cIndex + 1);
            yield return Text.Substring(sIndex, nIndex - sIndex);
            cIndex = nIndex;
        }
        yield return Text.Substring(cIndex + 1);
    }
    

    You can then call:

    var result = input.Lines().ToArray();
    
    0 讨论(0)
提交回复
热议问题