Are disk sector writes atomic?

前端 未结 9 1307
名媛妹妹
名媛妹妹 2020-12-22 20:31

Clarified Question:

When the OS sends the command to write a sector to disk is it atomic? i.e. Write of new data succeeds fully or old data is left intact should the

相关标签:
9条回答
  • 2020-12-22 20:32

    People don't seem to agree on what happens during a sector write if the power fails. Maybe because it depends on the hardware being used, and even the filesystem.

    From wikipedia (http://en.wikipedia.org/wiki/Journaling_file_system):

    Some disk drives guarantee write atomicity during a power failure. Others, however, may stop writing midway through a sector after power is lost, leaving it mismatched against its error-correcting code. The sector is thus corrupt and its contents lost. A physical journal guards against such corruption because it holds a complete copy of the sector, which it can replay over the corruption upon next mount.

    Seems to suggest that some hard drives will not finish writing the sector, but that a journaling filesystem can protect you from data loss the same way the xlog protects a database.

    From the linux kernel mailing list in a discussion on ext3 journaling filesystem:

    In any case bad sector checksum is hardware bug. Sector write is supposed to be atomic, it either happens or not.

    I'd tend to believe that over the wiki comment. Actually, the very existence of a database (firebird) with no xlog implies that sector write is atomic, that it cannot clobber data you did not mean to change.

    There's quite a bit of discussion Here about atomicity of sector writes, and again no agreement. But the people who are disagreeing seem to be talking about multiple-sector writes (which are not atomic on many modern hard-drives.) Those who are saying sector writes are atomic do seem to know more about what they're talking about.

    0 讨论(0)
  • 2020-12-22 20:35

    Nobody seems to agree on this question. So I spent a lot of time trying different Google queries until I finally found an answer.

    from Dr. Stephen Tweedie, RedHat employee and linux kernel filesystem and virtual memory developer in a talk on ext3 (which he developed) transcript here. If anyone knows, it'd be him.

    "It's not sufficient just to write the thing to the journal, because there's got to be some mark in the journal which says: well, (has this journal record actually) does this journal record actually represent a complete consistency to the disk? And the way you do that is by having some atomic operation which marks that transaction as being complete on disk" [23m, 14s]

    "Now, disks these days actually make these guarantees. If you start a write operation to a disk, then even if the power fails in the middle of that sector write, the disk has enough power available, and it can actually steal power from the rotational energy of the spindle; it has enough power to complete the write of the sector that's being written right now. In all cases, the disks make that guarantee." [23m, 41s]

    0 讨论(0)
  • 2020-12-22 20:35

    I suspect this assumption is wrong.

    Modern HDDs encode the data in sectors - and additionally protect it with ECC. Therefore you can end-up with garbaging all the sector content - it will just not make sense with the encoding used.

    As for increasingly poplular SSDs, the situation is even more gruesome - the block is cleared prior to being overwritten, so, depending on the firmware being used and the amount of free space, entirely unrelated sectors can be damaged.

    By the way, an OS crash will not lead to data being damaged within single sector.

    0 讨论(0)
  • 2020-12-22 20:42

    I would expect one torn page to consist of part X, part Y, and part unreadable sector. If a head is in the middle of writing a sector when the power fails, the drive should park the heads immediately, so that the rest of the drive (aside from that one sector) will remain undamaged.

    In some cases I would expect several torn pages consisting of part X and part Y, but only one torn page would include an unreadable sector. The reason for several torn pages is that the drive can buffer lots of writes internally, and the order of writing might interleave various sectors from various pages.

    I've read conflicting stories about whether a new write to the unreadable sector will make it readable again. Even if the answer is yes, that will be new data Z, neither X nor Y.

    0 讨论(0)
  • 2020-12-22 20:44

    The answer to your first question depends on the hardware involved. At least with some older hardware, the answer was yes -- a power failure could result it garbage being written to the disk. Most current disks, however, have a bit of a "UPS" built into the disk itself -- a capacitor that's large enough to power the disk long enough to write the data in the on-disk cache out to the disk platter. They also have circuitry to detect whether the power supply is still good, so when the power gets flaky, they write the data in the cache to the platter, and ignore garbage they might receive.

    As far as a "torn page" goes, a typical disk only accepts commands to write an entire sector at a time, so what you'll get will normally be an integral number of sectors written correctly, and others remaining unchanged. If, however, you're using a logical page size that's larger than a single sector, you can certainly end up with a page that's partially written.

    That, however, mostly applies to a direct connection to a normal moving-platter type hard drive. With almost anything else, the rules can and often will be different. Just for an obvious example, if you're writing over the network, you're mostly at the mercy of the network protocol in use. If you transmit data over TCP, data that doesn't match up with the CRC will be rejected, but the same data transmitted over UDP, with the same corruption, might be accepted.

    0 讨论(0)
  • 2020-12-22 20:46

    I think torn pages are not the problem. As far as I know, all drives have enough power stored to finish writing the current sector when the power fails.

    The problem is that everybody lies.

    At least when it comes to the database knowing when a transaction has been committed to disk, everybody lies. The database issues an fsync, and the operating system only returns when all outstanding writes have been committed to disk, right? Maybe not. It's common, especially with RAID cards and/or SATA drives, for your program to be told everything has committed (that is, fsync returns) and yet there is data not yet on the drive.

    You can try using Brad's diskchecker to find out if the platform you are going to use for your database can survive pulling the plug without losing data. The bottom line: If diskchecker fails, the platform is not safe for running a database. Databases with ACID rely upon knowing when a transaction has been committed to backing store and when it has not. This is true whether or not the databases uses write-ahead loggin (and if the database returns to the user without having done an fsync, then transactions can be lost in the event of a failure, so it should not claim that it provides ACID semantics).

    There's a long thread on the Postgresql mailing list discussing durability. It starts out talking about SSDs, but then it gets into SATA drives, SCSI drives, and file systems. You may be surprised to learn how exposed your data can be to loss. It's a good thread for anyone with a database that needs durability, not just those running Postgresql.

    0 讨论(0)
提交回复
热议问题