问题
I am trying to split a text which uses a mix of new line characters LF
, CRLF
and NEL
. I need the best method to exclude NEL
character out of the scene.
Is there an option to instruct readlines()
to exlude NEL while splitting lines? I may be able to read()
and go for matching only LF
and CRLF
split points on a loop.
Is there any better solution?
I open the file with codecs.open()
to open utf-8
text file.
And while using readlines()
, it does split at NEL characters:
The file contents are:
"u'Line 1 \\x85 Line 1.1\\r\\nLine 2\\r\\nLine 3\\r\\n'"
回答1:
file.readlines()
will only ever split on \n
, \r
or \r\n
depending on the OS and if universal newline support is enabled.
U+0085 NEXT LINE (NEL) is not recognised as a newline splitter in that context, and you don't need to do anything special to have file.readlines()
ignore it.
Quoting the open() function documentation:
Python is usually built with universal newlines support; supplying
'U'
opens the file as a text file, but lines may be terminated by any of the following: the Unix end-of-line convention'\n'
, the Macintosh convention'\r'
, or the Windows convention'\r\n'
. All of these external representations are seen as'\n'
by the Python program. If Python is built without universal newlines support a mode with'U'
is the same as normal text mode. Note that file objects so opened also have an attribute called newlines which has a value of None (if no newlines have yet been seen),'\n'
,'\r'
,'\r\n'
, or a tuple containing all the newline types seen.
and the universal newlines glossary entry:
A manner of interpreting text streams in which all of the following are recognized as ending a line: the Unix end-of-line convention
'\n'
, the Windows convention'\r\n'
, and the old Macintosh convention'\r'
. See PEP 278 and PEP 3116, as well asstr.splitlines()
for an additional use.
Unfortunately, codecs.open()
breaks with this rule; the documentation vaguely alludes to the specific codec being asked:
Line-endings are implemented using the codec’s decoder method and are included in the list entries if keepends is true.
Instead of codecs.open()
, use io.open() to open the file in the correct encoding, then process the lines one by one:
with io.open(filename, encoding=correct_encoding) as f:
lines = f.open()
io
is the new I/O infrastructure that replaces the Python 2 system entirely in Python 3. It handles just \n
, \r
and \r\n
:
>>> open('/tmp/test.txt', 'wb').write(u'Line 1 \x85 Line 1.1\r\nLine 2\r\nLine 3\r\n'.encode('utf8'))
>>> import codecs
>>> codecs.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85', u' Line 1.1\r\n', u'Line 2\r\n', u'Line 3\r\n']
>>> import io
>>> io.open('/tmp/test.txt', encoding='utf8').readlines()
[u'Line 1 \x85 Line 1.1\n', u'Line 2\n', u'Line 3\n']
The codecs.open()
result is due to the code using str.splitlines() being used, which has a documentation bug; when splitting a unicode string, it'll split on anything that the Unicode standard deems to be a line break (which is quite a complex issue). The documentation for this method is falling short of explaining this; it claims to only split according to the Universal Newline rules.
来源:https://stackoverflow.com/questions/27807519/python-restrict-newline-characters-for-readlines