I\'m in startup of designing a client/server audio system which can stream audio arbitrarily over a network. One central server pumps out an audio stream and x number of clients
"...as long as it is perceived to be in sync by a human listener" - Very hard to do because the ear is less forgiving than the eye. Especially if you want to do this over a wireless network.
I would experiment first with web based technologies, flash audio players remote controlled by a server via Javascript.
If that gave bad results then I would try to get more control by using something like python (with pygame).
If progress was being made I would also try using ChucK and try some low level programming with the ALSA audio library.
If nothing satisfactory comes out I would come and revisit this post and actually read something sensible by an expert audio programming guru and, if my livelihood depended on it, probably end up forking the 14 English pounds for the commercial NetChorus application or something similar.
Hard problem, but possible.
Use NTP or tictoc to get yourself a synchronised clock with a known rate in terms of your system's time source.
Also keep an estimator running as to the rate of your sound clock; the usual way of doing this is to record with the same sound device that is playing, recording over a buffer preloaded with a magic number, and see where the sound card gets to in a measured time by the synchronised clock (or vice versa, see how long it takes to do a known number of samples on the synchronised clock). You need to keep doing this, the clock will drift relative to network time.
So now you know exactly how many samples per second by your soundcard's clock you need to output to match the rate of the synchronised clock. So you then interpolate the samples received from the network at that rate, plus or minus a correction if you need to catch up or fall back a bit from where you got to on the last buffer. You will need to be extremely careful about doing this interpolation in such a way that it does not introduce audio artifacts; there is example code here for the algorithms you will need, but it's going to be quite a bit of reading before you get up to speed on that.
If your source is a live recording, of course, you're going to have to measure the sample rate of that soundcard and interpolate into network time samples before sending it.
Check out the paper An Internet Protocol Sound System by Tom Blank of Microsoft Research. He solves the exact problem you are working on. His solution involves synchronizing the clocks across machines and using timestamps to let them each play at the same time. The downside of this approach is latency. To get all of the clocks synchronized requires stamping the time at the largest latency on the network.
Ryan Barrett wrote up his findings on his blog.
His solution involved using NTP as a method to keep all the clocks in-sync:
Seriously, though, there's only one trick to p4sync, and that is how it uses NTP. One host acts as the p4sync server. The other p4sync clients synchronize their system clocks to the server's clock, using SNTP. When the server starts playing a song, it records the time, to the millisecond. The clients then retrieve that timestamp, calculate the difference between current time from that timestamp, and seek forward that far into the song.
Depending on the size and shape of the venue, getting everything to be in sync is the easy part, getting everything to sound correct is an art-form in itself, if possible at all. From the technical side, the most difficult part is finding out the delay from your synchronized timeline to actual sound output. Having identical hardware and low latency software framework (ASIO, JACK) certainly helps here, as does calibration. Either ahead of time or active. Otherwise it's just synchronizing the timeline with NTP and using a closed loop feedback to the audio pitch to synchronize the output to the agreed timeline.
The larger problem is that sound takes a considerable amount of time to propagate. 10m of difference in distance is already 30ms of delay - enough to screw up sound localization. Double that and you get into the annoying echo territory. Professional audio setups actually purposefully introduce delays, use a higher number of tweeters and play with reverberations to avoid a cacophony of echoes that wears the listener out.