splitlines() and iterating over an opened file give different results

问题

I have files with sometimes weird end-of-lines characters like \r\r\n. With this, it works like I want:

with open('test.txt', 'wb') as f:  # simulate a file with weird end-of-lines
    f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
    for l in f:
        print(l)
# b'abc\r\r\n'         
# b'def'

I want to able to get the same result from a string. I thought about splitlines but it does not give the same result:

print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']

Even with keepends=True, it's not the same result.

Question: how to have the same behaviour than for l in f with splitlines()?

Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232

Note: I don't want to put everything in a BytesIO or StringIO, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.

回答1:

Why don't you just split it:

input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n') 
print(result)

[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']

You will loose the trailing \n that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like

fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
    fixed[-1] = fixed[-1][:-1]
print(fixed)

[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']

Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input) :

def bin_split(input_str):
    start = 0
    while start>=0 :
        found = input_str.find(b'\n', start) + 1
        if 0 < found < len(input_str):
            yield input_str[start : found]
            start = found
        else:
            yield input_str[start:]
            break

回答2:

There are a couple ways to do this, but none are especially fast.

If you want to keep the line endings, you might try the re module:

lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)

If you need the endings and the file is really big, you may want to iterate instead:

for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
    line = r.group()
    # do stuff with line here

If you don't need the endings, then you can do it much more easily:

lines = list(filter(None, text.splitlines()))

You can omit the list() part if you just iterate over the results (or if using Python2):

for line in filter(None, text.splitlines()):
    pass # do stuff with line

回答3:

I would iterate through like this:

text  = "b'abc\r\r\ndef'"

results = text.split('\r\r\n')

for r in results:
    print(r)

回答4:

This is a for l in f: solution:

The key to this is the newline argument on the open call. From the documentation:

[![enter image description here][1]][1]

Therefore, you should use newline='' when writing to suppress newline translation and then when reading use newline='\n', which will work if all your lines terminate with 0 or more '\r' characters followed by a '\n' character:

with open('test.txt', 'w', newline='') as f:
    f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
    for line in f:
        print(repr(line))

Prints:

'abc\r\r\n'
'def'

A quasi-splitlines solution:

This strictly speaking not a splitlines solution since to be able to handle arbitrary line endings a regular expression version of split would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r' characters followed by a '\n' character:

import re

input = '\nabc\r\r\ndef\nghi\r\njkl'

with open('test.txt', 'w', newline='') as f:
    f.write(input)
with open('test.txt', 'r', newline='') as f:
    text = f.read()
    lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
    for line in lines:
        print(repr(line))

Prints:

'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'

Regex Demo

回答5:

Clearly, the 2 split functionalities, do 2 (slightly?) different things (the number of elements might even differ).
So what about a manual approach?

code00.py:

#!/usr/bin/env python

import sys
import os


def split_by_file(b):
    file_name = "_temp.tmp"
    with open(file_name, "wb") as f:
        f.write(b)
    with open(file_name, "rb") as f:
        l = f.readlines()
    os.unlink(file_name)
    return l


def split_manually(b):  # This is the function
    elems = b.split(b"\n")  # Would have been nice is `split` was a generator :)
    last = elems.pop()
    for elem in elems:
        yield elem + b"\n"
    if last:
        yield last


def split_by_splitlines(b):
    return b.splitlines()


def test_functionality():
    print("Test functionality...")
    texts = [
        b"",
        b"\n",
        b"\r",
        b"a\n",
        b"a\n\n\n",
        b"a\n\n\nb",
        b"abc\r\r\ndef",  # Your text
        b"a\rb\nc\r\nd\r\re\n\nf\n\rg\r\r\n\n\rh",
    ]

    funcs = [
        split_manually,
        split_by_splitlines,
    ]
    for text in texts:
        r0 = split_by_file(text)
        print("\nStream {:}: {:}".format(text, r0))
        for func in funcs:
            r1 = list(func(text))
            print("  {:}: {:} - {:}".format(func.__name__, r1, r1 == r0))


def main(*argv):
    test_functionality()


if __name__ == "__main__":
    print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
    main(*sys.argv[1:])
    print("\nDone.")

Output:

[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q065765343]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py
Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 64bit on win32

Test functionality...

Stream b'': []
split_manually: [] - True
split_by_splitlines: [] - True

Stream b'\n': [b'\n']
split_manually: [b'\n'] - True
split_by_splitlines: [b''] - False

Stream b'\r': [b'\r']
split_manually: [b'\r'] - True
split_by_splitlines: [b''] - False

Stream b'a\n': [b'a\n']
split_manually: [b'a\n'] - True
split_by_splitlines: [b'a'] - False

Stream b'a\n\n\n': [b'a\n', b'\n', b'\n']
split_manually: [b'a\n', b'\n', b'\n'] - True
split_by_splitlines: [b'a', b'', b''] - False

Stream b'a\n\n\nb': [b'a\n', b'\n', b'\n', b'b']
split_manually: [b'a\n', b'\n', b'\n', b'b'] - True
split_by_splitlines: [b'a', b'', b'', b'b'] - False

Stream b'abc\r\r\ndef': [b'abc\r\r\n', b'def']
split_manually: [b'abc\r\r\n', b'def'] - True
split_by_splitlines: [b'abc', b'', b'def'] - False

Stream b'a\rb\nc\r\nd\r\re\n\nf\n\rg\r\r\n\n\rh': [b'a\rb\n', b'c\r\n', b'd\r\re\n', b'\n', b'f\n', b'\rg\r\r\n', b'\n', b'\rh']
split_manually: [b'a\rb\n', b'c\r\n', b'd\r\re\n', b'\n', b'f\n', b'\rg\r\r\n', b'\n', b'\rh'] - True
split_by_splitlines: [b'a', b'b', b'c', b'd', b'', b'e', b'', b'f', b'', b'g', b'', b'', b'', b'h'] - False

Done.

Notes:

Current run is from Win, but I get the same result on Linux and OSX (various Python 3 versions)
Performance:
- I didn't try anything as I don't know what you're after (well of course, speed is the key, but I don't have a comparison baseline)
- Might be data dependent
- From memory consumption's PoV, it's (a bit) bad (this might negatively affect speed as well)

来源：https://stackoverflow.com/questions/65765343/splitlines-and-iterating-over-an-opened-file-give-different-results

标签

python

split

end-of-line