I\'m currently working on a system that allows users to reply to notification emails that are sent out (sigh).
I need to strip out the replies and signature
If your system is in-house and/or you have a limited number of reply formats, it's possible to do a pretty good job. Here are the filters we have set up for email responses to trac tickets:
Drop all text after and including:
'-- \n'
(standard email sig delimiter)'--\n'
(people often forget the space in sig delimiter; and this is not that common outside sigs)'-----Original Message-----'
(MS Outlook default)'________________________________
' (32 underscores, Outlook again)'On '
and end with ' wrote:\n'
(OS X Mail.app default)'From: '
(failsafe four Outlook and some other reply formats)'Sent from my iPhone'
'Sent from my BlackBerry'
Numbers 3 and 4 are 'begin with' instead of 'equals' because sometimes users will squash lines together on accident.
We try to be more liberal about stripping out replies, since it's much more of an annoyance (to us) have reply garbage than it is to correct missing text.
Anybody have other formats from the wild that they want to share?
An approach that can be used for signature only (in addition to detect __ or --) is to test if the first name and/or family name of the sender is on a short line (~ containing 3 to 4 words, max).
The sender name is on the raw email header, most of the time next to the email address, like in:
From: John Doe <jdoe@provider.com>
This would be based on the assumption that you rarely write your own name in a email, and if you do so, it is probably in a long sentence.
Of course there will be some false positive, but it may not be a big problem depending on what you do (we use it to fold quoted text and signature into a ... gmail-style button, so overdetection does not end up into losing any content, it is just misplaced).
There's a really nice PHP library dedicated to the email parsing
http://williamdurand.fr/EmailReplyParser/
https://github.com/willdurand/EmailReplyParser
Check out the email_reply_parser gem - https://github.com/github/email_reply_parser . It does a nice job handling this problem.
The recommended signature delimiter is "-- \n". If people follow this recommendation, stripping signatures should be easy.
If you can assume that these emails are in plain text, just strip lines that begins with ">" as replies, and "-- " line should delimit signature. But those assumptions might not work, as not all people over internet use software that complies to rules.