Extract email sub-strings from large document

前端 未结 11 2173
星月不相逢
星月不相逢 2020-11-28 06:54

I have a very large .txt file with hundreds of thousands of email addresses scattered throughout. They all take the format:

......


        
相关标签:
11条回答
  • 2020-11-28 07:12
    import re
    mess = '''Jawadahmed@gmail.com Ahmed@gmail.com
                abc@gmail'''
    email = re.compile(r'([\w\.-]+@gmail.com)')
    result= email.findall(mess)
    
    if(result != None):
        print(result)
    

    The above code will help to you and bring the Gmail, email only after calling it.

    0 讨论(0)
  • 2020-11-28 07:13

    Example : string if mail id has (a-z all lower and _ or any no.0-9), then below will be regex:

    >>> str1 = "abcdef_12345@gmail.com"
    >>> regex1 = "^[a-z0-9]+[\._]?[a-z0-9]+[@]\w+[.]\w{2,3}$"
    >>> re_com = re.compile(regex1)
    >>> re_match = re_com.search(str1)
    >>> re_match
    <_sre.SRE_Match object at 0x1063c9ac0>
    >>> re_match.group(0)
    'abcdef_12345@gmail.com'
    
    0 讨论(0)
  • 2020-11-28 07:17
    import re 
    txt = 'hello from absc@gmail.com to par1@yahoo.com about the meeting @2PM'
    email  =re.findall('\S+@\S+',s)
    print(email)
    

    Printed output:

    ['absc@gmail.com', 'par1@yahoo.com']
    
    0 讨论(0)
  • 2020-11-28 07:23

    If you're looking for a specific domain:

    >>> import re
    >>> text = "this is an email la@test.com, it will be matched, x@y.com will not, and test@test.com will"
    >>> match = re.findall(r'[\w-\._\+%]+@test\.com',text) # replace test\.com with the domain you're looking for, adding a backslash before periods
    >>> match
    ['la@test.com', 'test@test.com']
    
    0 讨论(0)
  • 2020-11-28 07:24

    This code extracts the email addresses in a string. Use it while reading line by line

    >>> import re
    >>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
    >>> match = re.search(r'[\w\.-]+@[\w\.-]+', line)
    >>> match.group(0)
    '321dsasdsa@dasdsa.com.lol'
    

    If you have several email addresses use findall:

    >>> line = "should we use regex more often? let me know at  321dsasdsa@dasdsa.com.lol"
    >>> match = re.findall(r'[\w\.-]+@[\w\.-]+', line)
    >>> match
    ['321dsasdsa@dasdsa.com.lol', 'dadaads@dsdds.com']
    

    The regex above probably finds the most common non-fake email address. If you want to be completely aligned with the RFC 5322 you should check which email addresses follow the specification. Check this out to avoid any bugs in finding email addresses correctly.


    Edit: as suggested in a comment by @kostek: In the string Contact us at support@example.com. my regex returns support@example.com. (with dot at the end). To avoid this, use [\w\.,]+@[\w\.,]+\.\w+)

    Edit II: another wonderful improvement was mentioned in the comments: [\w\.-]+@[\w\.-]+\.\w+which will capture example@do-main.com as well.

    0 讨论(0)
提交回复
热议问题