问题
Looking at Jeff Dean's famous latency guides
Latency Comparison Numbers (~2012)
----------------------------------
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Mutex lock/unlock 25 ns
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
Compress 1K bytes with Zippy 3,000 ns 3 us
Send 1K bytes over 1 Gbps network 10,000 ns 10 us
Read 4K randomly from SSD* 150,000 ns 150 us ~1GB/sec SSD
Read 1 MB sequentially from memory 250,000 ns 250 us
Round trip within same datacenter 500,000 ns 500 us
Read 1 MB sequentially from SSD* 1,000,000 ns 1,000 us 1 ms ~1GB/sec SSD, 4X memory
Disk seek 10,000,000 ns 10,000 us 10 ms 20x datacenter roundtrip
Read 1 MB sequentially from disk 20,000,000 ns 20,000 us 20 ms 80x memory, 20X SSD
Send packet CA->Netherlands->CA 150,000,000 ns 150,000 us 150 ms
One thing which looks somewhat uncanny to me is the time taken to read 1MB sequentially from disk is only 10 times faster than sending a round trip packet across the Atlantic. Can anyone give me more intuition why this feels right.
回答1:
Q : 1MB SEQ-HDD-READ ~ 10x faster than a CA/NL trans-atlantic RTT - why this feels right?
Some "old" values ( with a few cross-QPI/NUMA updates from 2017 ) to start from:
0.5 ns - CPU L1 dCACHE reference
1 ns - speed-of-light (a photon) travel a 1 ft (30.5cm) distance
5 ns - CPU L1 iCACHE Branch mispredict
7 ns - CPU L2 CACHE reference
71 ns - CPU cross-QPI/NUMA best case on XEON E5-46*
100 ns - MUTEX lock/unlock
100 ns - CPU own DDR MEMORY reference
135 ns - CPU cross-QPI/NUMA best case on XEON E7-*
202 ns - CPU cross-QPI/NUMA worst case on XEON E7-*
325 ns - CPU cross-QPI/NUMA worst case on XEON E5-46*
10,000 ns - Compress 1 KB with Zippy PROCESS (+GHz,+SIMD,+multicore tricks)
20,000 ns - Send 2 KB over 1 Gbps NETWORK
250,000 ns - Read 1 MB sequentially from MEMORY
500,000 ns - Round trip within a same DataCenter
10,000,000 ns - DISK seek
10,000,000 ns - Read 1 MB sequentially from NETWORK
30,000,000 ns - Read 1 MB sequentially from DISK
150,000,000 ns - Send a NETWORK packet CA -> Netherlands
| | | |
| | | ns|
| | us|
| ms|
Trans-Atlantic Network RTT :
- Global optical networks work roughly at a speed of light (
300.000.000 m/s
) - LA(CA)-AMS(NL) packet has to travel not the geodetical "distance", but over a set of continental and trans-atlantic "submarine" cables, the length of which is way longer ( see the map )
These factors do not "improve" - only the transport capacity is growing, with add-on latencies introduced in light-amplifiers, retiming units and other L1-PHY / L2-/L3-networking technologies are kept under control, as small as possible.
So the LA(CA)-AMS(NL) RTT will remain, using this technology, the same ~ 150 ms
Using other technology, LEO-Sat Cubes - as an example - the "distance" will only grow from ~ 9000 km P2P, by a pair of additional GND/LEO segments, plus by a few addition LEO/LEO hops, which introduce "longer" distance, add-on hop/hop re-processing latencies and capacity will not get any close to the current optical transports available, so no magic jump "back to the future" is to be expected ( we still miss the DeLorean ).
The HDD Disk :
- HDD-s can have very fast and very short transport-path for moving the data, but the
READ
-ops have to wait for the physical / mechanical operations of the media-reading heads ( that takes most of the time here, not the actual data-transfer to the host RAM ) - HDD-s are rotational devices, the disk has to "align" where to start the read, which costs the first about
10 [ms]
- HDD-s devices store data into a static structure of heads( 2+, reading physical signals from the magnetic plates' surfaces ):cylinders( concentric circular zones on the plate, into which a cyl-aligned reading-head gets settled by disk-head micro-controller):sector( angular-sections of the cylinder, each carrying a block of the same sized data ~ 4KB, 8KB, ... )
These factors do not "improve" - all commodity produced drives remain at industry selected angular speeds of about { 5k4 | 7k2 | 10k | 15k | 18k }-spins/min (RPM). This means, that if a well-compacted data-layouts are maintained on such a disk, one continuous head:cylinder aligned reading round the whole cylinder will take:
>>> [ 1E3 / ( RPM / 60. ) for RPM in ( 5400, 7200, 10000, 15000, 18000 ) ]
11.1 ms per CYL @ 5k4 RPM disk,
8.3 ms per CYL @ 7k2 RPM disk,
6.0 ms per CYL @ 10k RPM disk,
4.0 ms per CYL @ 15k RPM disk,
3.3 ms per CYL @ 18k RPM disk.
Data-density is also limited by the magnetic media properties. Spintronics R&D will bring some more densely stored data, yet the last 30 years have been well inside the limits of the reliable magnetic storage.
More is to be expected from a trick to co-parallel-read from several heads at-once, yet this goes against the design of the embedded microcontrollers, so most of the reading goes but sequentially, from one head after another, into the HDD-controller onboard buffers, best if no cyl-to-cyl heads mechanical re-alignment were to take place ( technically this depends on the prior data-to-disc layout, maintained by the O/S and possible care of disk-optimisers ( originally called disk disk-"compression", which just tried to re-align the known sequences of FAT-described data-blocks, so as to follow the most optimal trajectory of head:cyl:sector transitions, depending most on the actual device's head:head and cyl:cyl latencies ). So even the most optimistic data-layout takes ~ 13..21 [ms]
to seek-and-read but one head:cyl-path
Laws of Physics decide
回答2:
Some numbers from 2020.
Load from L1 is 4 cycles on Intel Coffee Lake and Ryzen (0.8nsec on a 5GHz CPU).
Load from memory is ~215 cycles on Intel Coffee Lake (43nsec on a 5GHz CPU). ~280 cycles on Ryzen.
来源:https://stackoverflow.com/questions/59505783/latency-of-accessing-main-memory-is-almost-the-same-order-of-sending-a-packet