问题
I have files with sometimes weird end-of-lines characters like \r\r\n
. With this, it works like I want:
with open('test.txt', 'wb') as f: # simulate a file with weird end-of-lines
f.write(b'abc\r\r\ndef')
with open('test.txt', 'rb') as f:
for l in f:
print(l)
# b'abc\r\r\n'
# b'def'
I want to able to get the same result from a string. I thought about splitlines
but it does not give the same result:
print(b'abc\r\r\ndef'.splitlines())
# [b'abc', b'', b'def']
Even with keepends=True
, it's not the same result.
Question: how to have the same behaviour than for l in f
with splitlines()
?
Linked: Changing str.splitlines to match file readlines and https://bugs.python.org/issue22232
Note: I don't want to put everything in a BytesIO
or StringIO
, because it does a x0.5 speed performance (already benchmarked); I want to keep a simple string. So it's not a duplicate of How do I wrap a string in a file in Python?.
回答1:
Why don't you just split it:
input = b'\nabc\r\r\r\nd\ref\nghi\r\njkl'
result = input.split(b'\n')
print(result)
[b'', b'abc\r\r\r', b'd\ref', b'ghi\r', b'jkl']
You will loose the trailing \n
that can be added later to every line, if you really need them. On the last line there is a need to check if it is really needed. Like
fixed = [bstr + b'\n' for bstr in result]
if input[-1] != b'\n':
fixed[-1] = fixed[-1][:-1]
print(fixed)
[b'\n', b'abc\r\r\r\n', b'd\ref\n', b'ghi\r\n', b'jkl']
Another variant with a generator. This way it will be memory savvy on the huge files and the syntax will be similar to the original for l in bin_split(input)
:
def bin_split(input_str):
start = 0
while start>=0 :
found = input_str.find(b'\n', start) + 1
if 0 < found < len(input_str):
yield input_str[start : found]
start = found
else:
yield input_str[start:]
break
回答2:
There are a couple ways to do this, but none are especially fast.
If you want to keep the line endings, you might try the re
module:
lines = re.findall(r'[\r\n]+|[^\r\n]+[\r\n]*', text)
# or equivalently
line_split_regex = re.compile(r'[\r\n]+|[^\r\n]+[\r\n]*')
lines = line_split_regex.findall(text)
If you need the endings and the file is really big, you may want to iterate instead:
for r in re.finditer(r'[\r\n]+|[^\r\n]+[\r\n]*', text):
line = r.group()
# do stuff with line here
If you don't need the endings, then you can do it much more easily:
lines = list(filter(None, text.splitlines()))
You can omit the list()
part if you just iterate over the results (or if using Python2):
for line in filter(None, text.splitlines()):
pass # do stuff with line
回答3:
I would iterate through like this:
text = "b'abc\r\r\ndef'"
results = text.split('\r\r\n')
for r in results:
print(r)
回答4:
This is a for l in f:
solution:
The key to this is the newline
argument on the open
call. From the documentation:
[![enter image description here][1]][1]
Therefore, you should use newline=''
when writing to suppress newline translation and then when reading use newline='\n'
, which will work if all your lines terminate with 0 or more '\r'
characters followed by a '\n'
character:
with open('test.txt', 'w', newline='') as f:
f.write('abc\r\r\ndef')
with open('test.txt', 'r', newline='\n') as f:
for line in f:
print(repr(line))
Prints:
'abc\r\r\n'
'def'
A quasi-splitlines solution:
This strictly speaking not a splitlines
solution since to be able to handle arbitrary line endings a regular expression version of split
would have to be used capturing the line endings and then re-assembling the lines and their endings. So, instead this solution just uses a regular expression to break up the input text allowing line endings consisting of any number of '\r'
characters followed by a '\n'
character:
import re
input = '\nabc\r\r\ndef\nghi\r\njkl'
with open('test.txt', 'w', newline='') as f:
f.write(input)
with open('test.txt', 'r', newline='') as f:
text = f.read()
lines = re.findall(r'[^\r\n]*\r*\n|[^\r\n]+$', text)
for line in lines:
print(repr(line))
Prints:
'\n'
'abc\r\r\n'
'def\n'
'ghi\r\n'
'jkl'
Regex Demo
回答5:
Clearly, the 2 split functionalities, do 2 (slightly?) different things (the number of elements might even differ).
So what about a manual approach?
code00.py:
#!/usr/bin/env python
import sys
import os
def split_by_file(b):
file_name = "_temp.tmp"
with open(file_name, "wb") as f:
f.write(b)
with open(file_name, "rb") as f:
l = f.readlines()
os.unlink(file_name)
return l
def split_manually(b): # This is the function
elems = b.split(b"\n") # Would have been nice is `split` was a generator :)
last = elems.pop()
for elem in elems:
yield elem + b"\n"
if last:
yield last
def split_by_splitlines(b):
return b.splitlines()
def test_functionality():
print("Test functionality...")
texts = [
b"",
b"\n",
b"\r",
b"a\n",
b"a\n\n\n",
b"a\n\n\nb",
b"abc\r\r\ndef", # Your text
b"a\rb\nc\r\nd\r\re\n\nf\n\rg\r\r\n\n\rh",
]
funcs = [
split_manually,
split_by_splitlines,
]
for text in texts:
r0 = split_by_file(text)
print("\nStream {:}: {:}".format(text, r0))
for func in funcs:
r1 = list(func(text))
print(" {:}: {:} - {:}".format(func.__name__, r1, r1 == r0))
def main(*argv):
test_functionality()
if __name__ == "__main__":
print("Python {0:s} {1:d}bit on {2:s}\n".format(" ".join(elem.strip() for elem in sys.version.split("\n")), 64 if sys.maxsize > 0x100000000 else 32, sys.platform))
main(*sys.argv[1:])
print("\nDone.")
Output:
[cfati@CFATI-5510-0:e:\Work\Dev\StackOverflow\q065765343]> "e:\Work\Dev\VEnvs\py_pc064_03.08.07_test0\Scripts\python.exe" code00.py Python 3.8.7 (tags/v3.8.7:6503f05, Dec 21 2020, 17:59:51) [MSC v.1928 64 bit (AMD64)] 64bit on win32 Test functionality... Stream b'': [] split_manually: [] - True split_by_splitlines: [] - True Stream b'\n': [b'\n'] split_manually: [b'\n'] - True split_by_splitlines: [b''] - False Stream b'\r': [b'\r'] split_manually: [b'\r'] - True split_by_splitlines: [b''] - False Stream b'a\n': [b'a\n'] split_manually: [b'a\n'] - True split_by_splitlines: [b'a'] - False Stream b'a\n\n\n': [b'a\n', b'\n', b'\n'] split_manually: [b'a\n', b'\n', b'\n'] - True split_by_splitlines: [b'a', b'', b''] - False Stream b'a\n\n\nb': [b'a\n', b'\n', b'\n', b'b'] split_manually: [b'a\n', b'\n', b'\n', b'b'] - True split_by_splitlines: [b'a', b'', b'', b'b'] - False Stream b'abc\r\r\ndef': [b'abc\r\r\n', b'def'] split_manually: [b'abc\r\r\n', b'def'] - True split_by_splitlines: [b'abc', b'', b'def'] - False Stream b'a\rb\nc\r\nd\r\re\n\nf\n\rg\r\r\n\n\rh': [b'a\rb\n', b'c\r\n', b'd\r\re\n', b'\n', b'f\n', b'\rg\r\r\n', b'\n', b'\rh'] split_manually: [b'a\rb\n', b'c\r\n', b'd\r\re\n', b'\n', b'f\n', b'\rg\r\r\n', b'\n', b'\rh'] - True split_by_splitlines: [b'a', b'b', b'c', b'd', b'', b'e', b'', b'f', b'', b'g', b'', b'', b'', b'h'] - False Done.
Notes:
- Current run is from Win, but I get the same result on Linux and OSX (various Python 3 versions)
- Performance:
- I didn't try anything as I don't know what you're after (well of course, speed is the key, but I don't have a comparison baseline)
- Might be data dependent
- From memory consumption's PoV, it's (a bit) bad (this might negatively affect speed as well)
来源:https://stackoverflow.com/questions/65765343/splitlines-and-iterating-over-an-opened-file-give-different-results