问题
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).
I ran into one weird problem where a few of the files I'm parsing through spew out weird characters in the middle of a line, ruining my parsing of readline() returns. When reading in a text editor, the line in question looks normal, but readline() reads an '=' and two '\n' characters right smack in the middle of an IP.
e.g.
Normal return from readline():
"IP Address: xxx.xxx.xxx.xxx"
Broken readline() return:
"IP Address: xxx.xxx.xxx="
The next two lines after that being:
""
".xxx"
Any idea how I could get around this? I don't really have control over what problem could be causing this, I just kind of need to deal with it without getting too crazy.
Relevant function, for reference (I know it's a mess):
def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while not ("Torrent Hash Value: " in iplabel):
iplabel = ce.readline()
ipraw = ce.readline()
if ("File Size" in ipraw):
ipraw = ce.readline()
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
ipraw = ce.readline()
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
return ("No IP found in: " + ipraw)
ce.close()
回答1:
It seems likely that at least some of the emails that you are processing have been encoded as quoted-printable.
This encoding is used to make 8-bit character data transportable over 7-bit (ASCII-only) systems, but it also enforces a fixed line length of 76 characters. This is implemented by inserting a soft line break consisting of "=" followed by the end of line marker.
Python provides the quopri module to handle encoding and decoding from quoted-printable. Decoding your data from quoted-printable will remove these soft line breaks.
As an example, let's use the first paragraph of your question.
>>> import quopri
>>> s = """I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.)."""
>>> # Encode to latin-1 as quopri deals with bytes, not strings.
>>> bs = s.encode('latin-1')
>>> # Encode
>>> encoded = quopri.encodestring(bs)
>>> # Observe the "=\n" inserted into the text.
>>> encoded
b"I'm writing a small script to run through large folders of copyright notice=\n emails and finding relevant information (IP and timestamp). I've already f=\nound ways around a few little formatting hurdles (sometimes IP and TS are o=\nn different lines, sometimes on same, sometimes in different places, timest=\namps come in 4 different formats, etc.)."
>>> # Printing without decoding from quoted-printable shows the "=".
>>> print(encoded.decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice=
emails and finding relevant information (IP and timestamp). I've already f=
ound ways around a few little formatting hurdles (sometimes IP and TS are o=
n different lines, sometimes on same, sometimes in different places, timest=
amps come in 4 different formats, etc.).
>>> # Decode from quoted-printable to remove soft line breaks.
>>> print(quopri.decodestring(encoded).decode('latin-1'))
I'm writing a small script to run through large folders of copyright notice emails and finding relevant information (IP and timestamp). I've already found ways around a few little formatting hurdles (sometimes IP and TS are on different lines, sometimes on same, sometimes in different places, timestamps come in 4 different formats, etc.).
To decode correctly, the entire message body needs to be processed, which conflicts with your approach using readline
. One way around this is to load the decoded string into a buffer:
import io
def getIP(em):
with open(em, 'rb') as f:
bs = f.read()
decoded = quopri.decodestring(bs).decode('latin-1')
ce = io.StringIO(decoded)
iplabel = ""
while not ("Torrent Hash Value: " in iplabel):
iplabel = ce.readline()
...
If your files contain complete emails - including headers - then using the tools in the email module will handle this decoding automatically.
import email
from email import policy
with open('message.eml') as f:
s = f.read()
msg = email.message_from_string(s, policy=policy.default)
body = msg.get_content()
回答2:
Solved, if anyone else has a similar problem, save each line as a string, merge them together, and re.sub() them out, keeping in mind \r and \n characters. My solution is a bit spaghetti, but prevents unneeded regex being done on every file:
def getIP(em):
ce = codecs.open(em, encoding='latin1')
iplabel = ""
while not ("Torrent Hash Value: " in iplabel):
iplabel = ce.readline()
ipraw = ce.readline()
if ("File Size" in ipraw):
ipraw = ce.readline()
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
ipraw2 = ce.readline() #made this a new var
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw2)
if ip:
return ip[0]
ce.close()
else:
ipraw = ipraw + ipraw2 #Added this section
ipraw = re.sub(r'(=\r*\n)', '', ipraw) #
ip = re.findall( r'[0-9]+(?:\.[0-9]+){3}', ipraw)
if ip:
return ip[0]
ce.close()
else:
return ("No IP found in: " + ipraw)
ce.close()
来源:https://stackoverflow.com/questions/55288102/stripping-out-unwanted-characters-that-are-breaking-readline