How to account for clock offsets in a distributed system?

问题

Background

I have a system consisting of several distributed services, each of which is continuously generating events and reporting these to a central service.

I need to present a unified timeline of the events, where the ordering in the timeline corresponds to the moment event occurred. The frequency of event occurrence and the network latency is such that I cannot simply use time of arrival at the central collector to order the events.

E.g. in the following scenario:

E1 needs to be rendered in the timeline above E2, despite arriving at the collector afterwards, which means the events need to come with timestamp metadata. This is where the problem arises.

Problem

Due to constraints on how the environment is set up, it is not possible to ensure that the local time services on each machine are reliably aware of current UTC time. I can assume that each machine can accurately gauge relative time, i.e. the clock speeds are close enough to make measurement of short timespans identical, but problems like NTP misconfiguration/partitioning make it impossible to guarantee that every machine agrees on the current UTC time.

This means that a naive approach of simply generating a local timestamp for each event as it occurs, then ordering events using that will not work: every machine has its own opinion of what universal time is.

So the question is: how can I recover an ordering for events generated in a distributed system where the clocks do not agree?

Approaches I've considered

Most solutions I find online go down the path of trying to synchronize all the clocks, which is not possible for me since:

I don't control the machines in question
The reason the clocks are out of sync in the first place is due to network flakiness, which I can't fix

My own idea was to query some kind of central time service every time an event is generated, then stamp that event with the retrieved time minus network flight time. This gets hairy, because I have to add another service to the system and ensure its availability (I'm back to square zero if the other services can't reach this one). I was hoping there is some clever way to do this that doesn't require me to centralize timekeeping in this way.

回答1:

A simple solution, somewhat inspired by your own at the end, is to periodically ping what I'll call the time-source server. In the ping include the service's chip clock; the time-source echos that and includes its timestamp. The service can then deduce the round-trip-time and guess that the time-source's clock was at the timestamp roughly round-trip-time/2 nanoseconds ago. You can then use this as an offset to the local chip clock to determine a globalish time.

You don't have to use a different service for this; the Collector server will do. The important part is that you don't have to ask call the time-source server at every request; it removes it from the critical path.

If you don't want a sawtooth function for the time, you can smooth the time difference

Congratulations, you've rebuilt NTP!

来源：https://stackoverflow.com/questions/46458089/how-to-account-for-clock-offsets-in-a-distributed-system

标签

time

synchronization

distributed-system

clock