Long version:
I\'m new to erlang, and considering using it for a scalable architecture. I\'ve found many proponents of the platform touting its reliabi
I'll separate this into points I hope will make sense. I might be re-hashing a bit of what I have written in The Hitchhiker's Guide to Concurrency. You might want to read that one to get details on the rationale behind the way message passing is done in Erlang.
1. Message transmission
Message passing in Erlang is done through asynchronous messages sent into mailboxes (a kind of queue for storing data). There is absolutely no assumption as to whether a message was received or not, or even that it was sent to a valid process. This is because it is plausible to assume [at a language level] that someone might want to treat a message in maybe only 4 days and won't even acknowledge its existence until it has reached a certain state.
A random example of this could be to imagine a long-running process that crunches data for 4 hours. Should it really acknowledge it received a message if it's unable to treat it? Maybe it should, maybe not. It really depends on your application. As such, no assumption is made. You can have half your messages asynchronous and only one that isn't.
Erlang expects you to send an acknowledgement message (and wait on it with a timeout) if you ever need it. The rules having to do with timing out and the format of the reply are left to the programmer to specify -- Erlang can't assume you want the acknowledgement on message reception, when a task is completed, whether it matches or not (the message could match in 4 hours when a new version of the code is hot-loaded), etc.
To make it short, whether a message isn't read, fails to be received or is interrupted by someone pulling the plug while it is in transit doesn't matter if you don't want it to. If you want it to matter, you need to design a logic across processes.
The burden of implementing a high-level message protocol between Erlang processes is given to the programmer.
2. Message protocols
As you said, these messages are stored in transient memory: if a process dies, all the messages it hadn't read yet are lost. If you want more, there are various strategies. A few of them are:
Depending on the task at hand, you might use one or many of these. They're all possible to implement in Erlang and in many cases modules are already written to do the heavy lifting for you.
So this might answer your question. Because you implement the protocols yourself, it's your choice whether messages get sent more than once or not.
3. What is fault-tolerance
Picking one of the above strategies does depend on what fault-tolerance means to you. In some cases, people mean it to say "no data is ever lost, no task ever fails." Other people use fault-tolerance to say "the user never sees a crash." In the case of Erlang systems, the usual meaning is about keeping the system running: it's alright to maybe have a single user dropping a phone call rather than having everyone dropping it.
Here the idea is then to let stuff that fails fail, but keep the rest running. To achieve this, there are a few things the VM gives you:
With these tools and a few of the standard library's modules handling different scenarios for you, you can implement pretty much what you want on top of Erlang's asynchronous semantics, although it usually pays to be able to use Erlang's definition of fault tolerance.
4. A few notes
My personal opinion here is that it's pretty hard to have more assumptions than what exists in Erlang unless you want pure transactional semantics. One problem you'll always have trouble with is with nodes going down. You can never know if they went down because the server actually crashed or because the network failed.
In the case of a server crash, simply re-doing the tasks is easy enough. However with a net split, you have to make sure some vital operations are not done twice, but not lost either.
It usually boils down to the CAP theorem which basically gives you 3 options, of which you have to pick two:
Depending on where you position yourself, different approaches will be needed. The CAP theorem is usually used to describe databases, but I believe similar questions are to be asked whenever you need some level of fault tolerance when processing data.