Extract email sub-strings from large document

前端 未结 11 2172
星月不相逢
星月不相逢 2020-11-28 06:54

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

......


        
相关标签:
11条回答
  • 2020-11-28 07:02
    import re
    
    reg_pat = r'\S+@\S+\.\S+'
    
    test_text = 'xyz.byc@cfg-jj.com    ir_er@cu.co.kl   uiufubvcbuw bvkw  ko@com    m@urice'   
    
    emails = re.findall(reg_pat ,test_text,re.IGNORECASE)
    print(emails)
    

    Output:

    ['xyz.byc@cfg-jj.com', 'ir_er@cu.co.kl']
    
    0 讨论(0)
  • 2020-11-28 07:05
    import re
    rgx = r'(?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]?\(?[ ]?(at|AT)[ ]?\)?[ ]?)(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])'
    matches = re.findall(rgx, text)
    get_first_group = lambda y: list(map(lambda x: x[0], y))
    emails = get_first_group(matches)
    

    Please don't hate me for having a go at this infamous regex. The regex works for a decent portion of email addresses shown below. I mostly used this as my basis for the valid chars in an email address.

    Feel free to play around with it here

    I also made a variation where the regex captures emails like name at example.com

    (?:\.?)([\w\-_+#~!$&\'\.]+(?<!\.)(@|[ ]\(?[ ]?(at|AT)[ ]?\)?[ ])(?<!\.)[\w]+[\w\-\.]*\.[a-zA-Z-]{2,3})(?:[^\w])
    
    0 讨论(0)
  • 2020-11-28 07:06

    Here's another approach for this specific problem, with a regex from emailregex.com:

    text = "blabla <hello@world.com>><123@123.at> <huhu@fake> bla bla <myname@some-domain.pt>"
    
    # 1. find all potential email addresses (note: < inside <> is a problem)
    matches = re.findall('<\S+?>', text)  # ['<hello@world.com>', '<123@123.at>', '<huhu@fake>', '<myname@somedomain.edu>']
    
    # 2. apply email regex pattern to string inside <>
    emails = [ x[1:-1] for x in matches if re.match(r"(^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$)", x[1:-1]) ]
    print emails   # ['hello@world.com', '123@123.at', 'myname@some-domain.pt']
    
    0 讨论(0)
  • 2020-11-28 07:06
    import re
    with open("file_name",'r') as f:
        s = f.read()
        result = re.findall(r'\S+@\S+',s)
        for r in result:
            print(r)
    
    0 讨论(0)
  • 2020-11-28 07:08

    You can use \b at the end to get the correct email to define ending of the email.

    The regex

    [\w\.\-]+@[\w\-\.]+\b
    
    0 讨论(0)
  • 2020-11-28 07:11

    You can also use the following to find all the email addresses in a text and print them in an array or each email on a separate line.

    import re
    line = "why people don't know what regex are? let me know asdfal2@als.com, Users1@gmail.de " \
           "Dariush@dasd-asasdsa.com.lo,Dariush.lastName@someDomain.com"
    match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
    for i in match:
        print(i)
    

    If you want to add it to a list just print the "match"

    # this will print the list
        print(match)
    
    0 讨论(0)
提交回复
热议问题