Email parsing and processing architechture

问题

ok im doing a heavy process on processing each email. lets say im making an AI for a system at he will auto-reply the email that he receive, but im still dont know where to start.

heres what im thinking of

architecture 1

problems :

lets say we have 1000 emails / sec how does a mail server, exim or sendmail, davecot etc exactly work?
can the parseandsavetomysql.py process 1000 emails in a sec though piping? how does that work too? btw currently its working fine, but i need to know about this.
is my logic correct about a worker? or a queuing system? i have tried to see resque and friends but i still just dont get it how can we lock a session lets say in this this problem "hey im processing this file dont work on email1.rawemail work on other" how can we do that the correct or simpler way?

architecture 2

problems?

as written
how can a pop/stmp server receive 1000 emails/sec?
we can get email via imap and pop? becouse we are just processing is pop3 the right way to chose on performance? there is a imap_open on php that im currently using

addon

is there a good link or blog post that solve the same problem as me?
please give me links of projects,app or 3rd parties that solve my problem?
if there is anything in mind, please do write them down.

thanks for helping out, Adam Ramadhan

edited my current architecture

回答1:

Like a lot of "big picture" architecture questions, the best solution is really one of those...it depends. Can you control the deployment environment? That is...can you use whatever e-mail server you'd like, or are you constrained to using one that's already installed and hosted? Can you run code on the same machine as the SMTP service? These questions, and a lot of others should be considered to come up with an (near) optimal architecture.

Given that, I'm going to make a couple of assumptions and offer some ideas that I think are worth exploring...

You should look into a high-performance messaging system. Specifically, take a look at RabbitMQ. RabbitMQ is reliable and efficient, and the distribution of workload based on asynchronous incoming events is a pattern that they specifically discuss in their (in my opinion, very good) tutorials.

With a messaging server like this, you have one process that receives the incoming e-mail. Preferably this is done as part of the SMTP process, or at least very close to it - especially with the work load that you've mentioned. If you have no other choice, then your ideas about using cron to gather messages via POP or IMAP will have to work, for now.

The e-mail gathering process would then push messages into the RabbitMQ queue. (Perhaps not literally the e-mails themselves, although that is a possibility, but I was thinking more like references to where the e-mail is efficiently stored). You then run multiple worker processes that are subscribed to a named message queue. RabbitMQ (or whatever messaging service you decide upon) would then distribute those messages in a round-robin fashion to the individual subscribers. If already loaded, worker processes can NACK the message, or send their own control flow message back to the service. With a VERY high workload (again, like you've proposed), I'd highly recommend some kind of management process that keeps tabs on the overall health of the distributed system. The manager would gather run time statistics (VERY useful for future growth planning, optimization, and refactoring of the overall system), and have the ability to spin up and shut down new worker processes. Before you get to that very high workload, and assuming that your worker processes are stable and can live a long time without memory fragmentation, etc., then just using the message server to distribute work should suffice.

For what it's worth, I've had some experience on writing e-mail processors (specifically xmail - one that I'd recommend if you're just starting out your project and have a lot of control over its early stages). Also, I'm currently using RabbitMQ to build a multi-agent result caching system for a major scientific computing grid.

Anyway...good luck with your project!

来源：https://stackoverflow.com/questions/10554482/email-parsing-and-processing-architechture

标签

php

python

mysql

architecture