Parsing “From” addresses from email text

前端 未结 8 2134
春和景丽
春和景丽 2021-02-19 03:56

I\'m trying to extract email addresses from plain text transcripts of emails. I\'ve cobbled together a bit of code to find the addresses themselves, but I don\'t know how to mak

相关标签:
8条回答
  • 2021-02-19 04:50

    "[stuff]@[stuff][stuff1-4 letters]" is about right, but if you wanted to you could decode the regular expression using a trick I just found out about, here. Do the compile() in an interactive Python session like this:

    mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}', 128)
    

    It will print out the following:

    in 
      category category_word
      literal 45
    max_repeat 1 65535 
      in 
        category category_word
        literal 45
        literal 46
    literal 64 
    in 
      category category_word
      literal 45
    max_repeat 1 65535 
      in 
        category category_word
        literal 45
        literal 46
    max_repeat 1 4 
      in 
        range (97, 122)
        range (65, 90)
    

    Which, if you can kind of get used to it, shows you exactly how the RE works.

    0 讨论(0)
  • 2021-02-19 04:52
    mailsrch = re.compile(r'[\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4}')
    

    Expression breakdown:

    [\w-]: any word character (alphanumeric, plus underscore) or a dash

    [\w-.]+: any word character, a dash, or a period/dot, one or more times

    @: literal @ symbol

    [\w-][\w-.]+: any word char or dash, followed by any word char, dash, or period one or more times.

    [a-zA-Z]{1,4}: any alphabetic character 1-4 times.

    To make this match only lines starting with From:, and wrapped in < and > symbols:

    import re
    
    foundemail = []
    mailsrch = re.compile(r'^From:\s+.*<([\w\-][\w\-\.]+@[\w\-][\w\-\.]+[a-zA-Z]{1,4})>', re.I | re.M)
    foundemail.extend(mailsrch.findall(open('text.txt').read()))
    
    print foundemail
    
    0 讨论(0)
提交回复
热议问题