I\'m trying to extract email addresses from plain text transcripts of emails. I\'ve cobbled together a bit of code to find the addresses themselves, but I don\'t know how to mak
"[stuff]@[stuff][stuff1-4 letters]" is about right, but if you wanted to you could decode the regular expression using a trick I just found out about, here. Do the compile() in an interactive Python session like this:
mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', 128)
It will print out the following:
in
category category_word
literal 45
max_repeat 1 65535
in
category category_word
literal 45
literal 46
literal 64
in
category category_word
literal 45
max_repeat 1 65535
in
category category_word
literal 45
literal 46
max_repeat 1 4
in
range (97, 122)
range (65, 90)
Which, if you can kind of get used to it, shows you exactly how the RE works.
mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
Expression breakdown:
[\w-]
: any word character (alphanumeric, plus underscore) or a dash
[\w-.]+
: any word character, a dash, or a period/dot, one or more times
@
: literal @ symbol
[\w-][\w-.]+
: any word char or dash, followed by any word char, dash, or period one or more times.
[a-zA-Z]{1,4}
: any alphabetic character 1-4 times.
To make this match only lines starting with From:
, and wrapped in < and > symbols:
import re
foundemail = []
mailsrch = re.compile(r'^From:\s+.*<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>', re.I | re.M)
foundemail.extend(mailsrch.findall(open('text.txt').read()))
print foundemail