One of our staff members has lost his mailbox but luckily has a dump of his email in mbox format. I need to somehow get all the messages inside the mbox file and squirt them
I'm working on a MIME & mbox parser in C# called MimeKit.
It's based on earlier MIME & mbox parsers I've written (such as GMime) which were insanely fast (could parse every message in an 1.2GB mbox file in about 1 second).
I haven't tested MimeKit for performance yet, but I am using many of the same techniques in C# that I used in C. I suspect it'll be slower than my C implementation, but since the bottleneck is I/O and MimeKit is written to do optimal (4k) reads like GMime is, they should be pretty close.
The reasons you are finding your current approach to be slow (StreamReader.ReadLine(), combining the text, then passing it off to SharpMimeTools) are because of the following reasons:
StreamReader.ReadLine() is not a very optimal way of reading data from a file. While I'm sure StreamReader() does internal buffering, it needs to do the following steps:
A) Convert the block of bytes read from the file into unicode (this requires iterating over the bytes in the byte[] read from disk to convert the bytes read from the stream into a unicode char[]).
B) Then it needs to iterate over its internal char[], copying each char into a StringBuilder until it finds a '\n'.
So right there, with just reading lines, you have at least 2 passes over your mbox input stream. Not to mention all of the memory allocations going on...
Then you combine all of the lines you've read into a single mega-string. This requires another pass over your input (copying every char from each string read from ReadLine() into a StringBuilder, presumably?).
We are now up to 3 iterations over the input text and no parsing has even happened yet.
Now you hand off your mega-string to SharpMimeTools which uses a SharpMimeMessageStream which... (/facepalm) is a ReadLine()-based parser that sits on top of another StreamReader that does charset conversion. That makes 5 iterations before anything at all is even parsed. SharpMimeMessageStream also has a way to "undo" a ReadLine() if it discovers it has read too far. So it is reasonable to assume that he is scanning over some of those lines at least twice. Not to mention all of the string allocations going on... ugh.
For each header, once SharpMimeTools has its line buffer, it splits into field & value. That's another pass. We are up to 6 passes so far.
SharpMimeTools then uses string.Split() (which is a pretty good indication that this mime parser is not standards compliant) to tokenize address headers by splitting on ',' and parameterized headers (such as Content-Type and Content-Disposition) by splitting on ';'. That's another pass. (We are now up to 7 passes.)
Once it splits those it runs a regex match on each string returned from the string.Split() and then more regex passes per rfc2047 encoded-word token before finally making another pass over the encoded-word charset and payload components. We're talking at least 9 or 10 passes over much of the input by this point.
I give up going any farther with my examination because it's already more than 2x as many passes as GMime and MimeKit need and I know my parsers could be optimized to make at least 1 less pass than they do.
Also, as a side-note, any MIME parser that parses strings instead of byte[] (or sbyte[]) is never going to be very good. The problem with email is that so many mail clients/scripts/etc in the wild will send undeclared 8bit text in headers and message bodies. How can a unicode string parser possibly handle that? Hint: it can't.
using (var stream = File.OpenRead ("Inbox.mbox")) {
var parser = new MimeParser (stream, MimeFormat.Mbox);
while (!parser.IsEndOfStream) {
var message = parser.ParseMessage ();
// At this point, you can do whatever you want with the message.
// As an example, you could save it to a separate file based on
// the message subject:
message.WriteTo (message.Subject + ".eml");
// You also have the ability to get access to the mbox marker:
var marker = parser.MboxMarker;
// You can also get the exact byte offset in the stream where the
// mbox marker was found:
var offset = parser.MboxMarkerOffset;
}
}
2013-09-18 Update: I've gotten MimeKit to the point where it is now usable for parsing mbox files and have successfully managed to work out the kinks, but it's not nearly as fast as my C library. This was tested on an iMac so I/O performance is not as good as it would be on my old Linux machine (which is where GMime is able to parse similar sized mbox files in ~1s):
[fejj@localhost MimeKit]$ mono ./mbox-parser.exe larger.mbox
Parsed 14896 messages in 6.16 seconds.
[fejj@localhost MimeKit]$ ./gmime-mbox-parser larger.mbox
Parsed 14896 messages in 3.78 seconds.
[fejj@localhost MimeKit]$ ls -l larger.mbox
-rw-r--r-- 1 fejj staff 1032555628 Sep 18 12:43 larger.mbox
As you can see, GMime is still quite a bit faster, but I have some ideas on how to improve the performance of MimeKit's parser. It turns out that C#'s fixed
statements are quite expensive, so I need to rework my usage of them. For example, a simple optimization I did yesterday shaved about 2-3s from the overall time (if I remember correctly).
Optimization Update: Just improved performance by another 20% by replacing:
while (*inptr != (byte) '\n')
inptr++;
with:
do {
mask = *dword++ ^ 0x0A0A0A0A;
mask = ((mask - 0x01010101) & (~mask & 0x80808080));
} while (mask == 0);
inptr = (byte*) (dword - 1);
while (*inptr != (byte) '\n')
inptr++;
Optimization Update: I was able to finally make MimeKit as fast as GMime by switching away from my use of Enum.HasFlag() and using direct bit masking instead.
MimeKit can now parse the same mbox stream in 3.78s.
For comparison, SharpMimeTools takes more than 20 minutes (to test this, I had to split the emails apart into separate files because SharpMimeTools can't parse mbox files).
Another Update: I've gotten it down to 3.00s flat via various other tweaks throughout the code.
I don't know any parser, but mbox is really a very simple format. A new email begins on lines starting with "From " (From+Space) and an empty line is attached to the end of each mail. Should there be any occurence of "From " at the beginning of a line in the email itself, this is quoted out (by prepending a '>').
Also see Wikipedia's entry on the topic.
If you can stretch to using Python, there is one in the standard library. I'm unable to find any for .NET sadly.
To read .mbox files, you can use a third-party library Aspose.Email. This library is a complete set of Email Processing APIs to build cross-platform applications having the ability to create, manipulate, convert, and transmit emails without using Microsoft Outlook.
Please, take a look at the example I have provided below.
using(FileStream stream = new FileStream("ExampleMbox.mbox", FileMode.Open, FileAccess.Read))
{
using(MboxrdStorageReader reader = new MboxrdStorageReader(stream, false))
{
// Start reading messages
MailMessage message = reader.ReadNextMessage();
// Read all messages in a loop
while (message != null)
{
// Manipulate message - show contents
Console.WriteLine("Subject: " + message.Subject);
// Save this message in EML or MSG format
message.Save(message.Subject + ".eml", SaveOptions.DefaultEml);
message.Save(message.Subject + ".msg", SaveOptions.DefaultMsgUnicode);
// Get the next message
message = reader.ReadNextMessage();
}
}
}
It is easy to use. I hope this approach will satisfy you and other searchers.
I am working as a Developer Evangelist at Aspose.