I\'m writing some code to parse forwarded emails. What I\'m not sure is if maybe there is some Python library, some RFC I could stick to or some other resou
In my experience just about ever email client forwards/replies differently. Typically you'll have a plain text version and a html encoded version in the mime at the bottom of the mail pack. Mail headers do have a RFC (http://www.faqs.org/rfcs/rfc2822.html "2822"), but unfortunately the content of the message body is out side the scope.
Not only do you have to contend with the mail client variance, but the variance of user preferences. As an example: Lotus Notes puts replies at the top and Thunderbird replies at the bottom. So when a Thunderbird user is replying to a Lotus Notes user's reply they might insert their reply at the top and leave their signature at the bottom.
Another pitfall maybe contending with word wrapping of replied chains.
>>>> The outer reply that goes over the limit and is word wraped by
the middle replier's mail client\n
>> The message body of a middle reply
> Previous reply
Newest reply
I wouldn't parse the message and leave it to the user to parse in their heads. Or, I'd borrow the code from another project.
Standard for a reply/forward is > prepending each line the number of times the mail is nested including who sent the initial e-mail is up to the client to sort out. So what you need to do in python is simply add > to the start of each line.
imap Test <imap@gazler.com> Wrote:
>
>twice
>imap Test wrote:
>> nested
>>
>> imap@gazler.com wrote:
>>> test
>>>
>>> --
>>> Message sent via AHEM.
>>>
>>
>
Attachments just simply need to be attached to the message or as you put it 'go wild.'
I am not familiar with python, but believe the code would be:
string = string.replace("\n","\n>")
Unlike what many other people said, there is a standard on forwarded emails, RFC 2046, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", more than ten years old. See specially its section 5.2, "Message Media Type".
The basic idea behind RFC 2046 is to encapsulate one message into the MIME part of another, of type named (unfortunately) message/rfc822
(never forget that MIME is recursive). The MIME library of Python can handle it fine.
I did not downvote the other answers because they are right in one respect: the standard is not followed by every mailer. For instance, the mutt mailer can forward a message in RFC 2046 format but also in a adhoc format. So, in practice, a mailer probably cannot handle only RFC 2046, it also has to parse the various others and underspecified syntaxes.
As the other answers already indicate: there is no standard, and your program is not going to be flawless.
You could have a look at the headers, in particular the User-Agent
header, to see what kind of client was used, and code specifically for the most common clients.
To find out what clients you should consider to support, have a look at this popularity study. Various Outlooks, Yahoo!, Hotmail, Mail.app, iPhone mail, Gmail and Lotus Notes rank highly. About 11% of the mail is classified as "undetectable", but using headers from the forwarded e-mail you might be able to do better than that. Note that the statistics were gathered by placing an image inside the e-mail, so results may be skewed.
Another problem is HTML mail, which may or may not include a plain-text version. I'm not sure about clients' usual behaviour in this respect.