I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:
......
import re
reg_pat = r'\S+@\S+\.\S+'
test_text = 'xyz.byc@cfg-jj.com ir_er@cu.co.kl uiufubvcbuw bvkw ko@com m@urice'
emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
print(emails)
Output:
['xyz.byc@cfg-jj.com', 'ir_er@cu.co.kl']
import re
rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
matches = re.findall(rgx, text)
get_first_group = lambda y: list(map(lambda x: x[0], y))
emails = get_first_group(matches)
Please don't hate me for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.
Feel free to play around with it here
I also made a variation where the regex captures emails like name at example.com
(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
Here's another approach for this specific problem, with a regex from emailregex.com:
text = "blabla <hello@world.com>><123@123.at> <huhu@fake> bla bla <myname@some-domain.pt>"
# 1. find all potential email addresses (note: < inside <> is a problem)
matches = re.findall('<\S+?>', text) # ['<hello@world.com>', '<123@123.at>', '<huhu@fake>', '<myname@somedomain.edu>']
# 2. apply email regex pattern to string inside <>
emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
print emails # ['hello@world.com', '123@123.at', 'myname@some-domain.pt']
import re
with open("file_name",'r') as f:
s = f.read()
result = re.findall(r'\S+@\S+',s)
for r in result:
print(r)
You can use \b at the end to get the correct email to define ending of the email.
The regex
[\w\.\-]+@[\w\-\.]+\b
You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.
import re
line = "why people don't know what regex are? let me know asdfal2@als.com, Users1@gmail.de " \
"Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"
match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
for i in match:
print(i)
If you want to add it to a list just print the "match"
# this will print the list
print(match)