I am looking for quantitative estimates on clock offsets between VMs on Windows Azure - assuming that all VMs are hosted in the same datacenter. I am gu
I've been in conversation with someone from the Azure product team regarding clock synchronisation recently, more out of interest than anything else. The most recent reply I've received is:
The VMs and services take their time directly from the underlying Hyper-V platform upon boot and from that point forward the clock is maintained by the service. In order to have true time sync across a distributed system you will need to do this at the application layer and/or with a service referencing an singular time server.
I've tried to search for an answer to this specific question - but haven't succeeded!
Some references I have found about the "Windows Time Service" - W32Time - reference that the design for the Windows service targets a tolerance of 2 seconds - e.g.
In practice within the Azure network I expect that the synchronisation achieved should be much better than this - but my search turned up no referenced guarantees on this.
You can never trust clocks synchronization if you are building distributed system unless special hardware measures are used as for example in Google Spanner. Even there a special algorithm is used to resolve possible clock skew conflicts. However, there are many algorithms, which allow to solve this problem in distributed systems: logical clocks, vector clocks, Lamport timestamps to name a few. See classical book "Distributed Systems: Principles and Paradigms" by Andrew Tanenbaum.
This is the classic problem of both distributed systems and virtual machines - clock skew.
One possible solution would be to use the Azure scheduler to ping an endpoint on each of your VM that would reset your clock - or at least tell you what the diff would be. That way, your skew would not grow, and you may even be able to calculate an offset for the communication delay. This way, you'd get to within milliseconds and not seconds.
Ofcourse, you could also go the other way, and have a service on the VM that periodically manages the clock by pinging out to some time server. I'm not sure if the hypervisor will let you mess with it's clock, but all you really need is an offset for your apps to consume.
Overall... never trust the clock on a VM, and certainly not over a distributed system. Note that this clock issue is part of active research in many universities. ie. https://scholar.google.com/scholar?hl=en&q=distributed+system+clock&btnG=&as_sdt=1%2C48&as_sdtp=
I have finally settled to do some experiments on my own.
A few facts concerning the experiment protocol:
Stopwatch
was always lower than 1ms for minimalistic unauthenticated requests (basically HTTP requests were coming back with 400 errors, but still with Date:
available in the HTTP headers).Results:
So technically, we are not too far from the 2s tolerance target, although for intra-data-center sync, you don't have to push the experiment far to observe close to 4s offset. If we assume a normal (aka Gaussian) distribution for the clock offsets, then I would say that relying on any clock threshold lower than 6s is bound to lead to scheduling issues.
/// <summary>
/// Substitute for proper NTP (Network Time Protocol)
/// when UDP is not available, as on Windows Azure.
/// </summary>
public class HttpTimeChecker
{
public static DateTime GetUtcNetworkTime(string server)
{
// HACK: we can't use WebClient here, because we get a faulty HTTP response
// We don't care about HTTP error, the only thing that matter is the presence
// of the 'Date:' HTTP header
var tc = new TcpClient();
tc.Connect(server, 80);
string response;
using (var ns = tc.GetStream())
{
var sw = new StreamWriter(ns);
var sr = new StreamReader(ns);
string req = "";
req += "GET / HTTP/1.0\n";
req += "Host: " + server + "\n";
req += "\n";
sw.Write(req);
sw.Flush();
response = sr.ReadToEnd();
}
foreach(var line in response.Split(new[] { '\r', '\n' }, StringSplitOptions.RemoveEmptyEntries))
{
if(line.StartsWith("Date: "))
{
return DateTime.Parse(line.Substring(6)).ToUniversalTime();
}
}
throw new ArgumentException("No date to be retrieved among HTTP headers.", "server");
}
}
Based on my experience, I would not rely on the system clock of the Azure VMs for anything critical. I have occasionally seen differences up to several minutes, which does fly in the face of what you'd expect.