How can I simulate a failed disk during testing?

前端 未结 7 1917
情书的邮戳
情书的邮戳 2020-12-08 01:11

In a Linux VM (Vmware workstation or similar), how can I simulate a failure on a previously working disc?

I have a situation happening in production where a disc fai

相关标签:
7条回答
  • 2020-12-08 01:41

    There are several layers at which a disk error can be simulated. If you are testing a single user-space program, probably the simplest approach is to interpose the appropriate calls (e.g. write()) and have them sometimes return an error. The libfiu fault-injection library can do this using its fiu-run tool.

    Another approach is to use a kernel driver that can pass through data to/from another device, but inject faults along the way. You can then mount the device and use it from any application as if it was a faulty disk. The fsdisk driver is an example of this.

    There is also a fault injection infrastructure that has been merged in to the Linux kernel, although you will probably need to reconfigure your kernel to enable it. It is documented in Documentation/fault-injection/fault-injection.txt. This is useful for testing kernel code.

    It is also possible to use SystemTap to inject faults at the kernel level. See The SCSI fault injection test and Kernel Fault injection using SystemTap.

    0 讨论(0)
  • 2020-12-08 01:41

    You may use scsi_debug kernel module to simulate a RAM disk and it supports all the SCSI errors with opts and every_nth options.

    Please check this http://sg.danny.cz/sg/sdebug26.html

    Example on medium error on sector 4656:

    [fge@Gris-Laptop ~]$ sudo modprobe scsi_debug opts=2 every_nth=1
    [fge@Gris-Laptop ~]$ sudo dd if=/dev/sdb of=/dev/null
    dd: error reading ‘/dev/sdb’: Input/output error
    4656+0 records in
    4656+0 records out
    2383872 bytes (2.4 MB) copied, 0.021299 s, 112 MB/s
    [fge@Gris-Laptop ~]$ dmesg|tail
    [11201.454332] blk_update_request: critical medium error, dev sdb, sector 4656
    [11201.456292] sd 5:0:0:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
    [11201.456299] sd 5:0:0:0: [sdb] Sense Key : Medium Error [current] 
    [11201.456303] sd 5:0:0:0: [sdb] Add. Sense: Unrecovered read error
    [11201.456308] sd 5:0:0:0: [sdb] CDB: Read(10) 28 00 00 00 12 30 00 00 08 00
    [11201.456312] blk_update_request: critical medium error, dev sdb, sector 4656
    

    You could alter the opts and every_nth options in runtime via sysfs:

    echo 2 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
    echo 1 | sudo tee /sys/bus/pseudo/drivers/scsi_debug/opts
    
    0 讨论(0)
  • 2020-12-08 01:49

    A simple way to make a SCSI disk disappear with a 2.6 kernel is:

    echo 1 > /sys/bus/scsi/devices/H:B:T:L/delete
    

    (H:B:T:L is host, bus, target, LUN). To simulate the read-only case you'll have to use the fault injection methods that mark4o mentioned, though.

    0 讨论(0)
  • 2020-12-08 01:53

    One can also use methods that are provided by the disks to do media error testing. SCSI has a WRITE LONG command that can be used to corrupt a block by writing data with invalid ECC. SATA and NVMe also have similar commands.

    For the most common case (SATA) you can use hdparm with --make-bad-sector to employ that command, you can use sg_write_long for SCSI and for NVMe you can use the nvme-cli with the write-uncor option.

    The big advantage that these commands have over other injection methods is that they also behave just like a drive does, with full latency impacts and also the recovery upon a write to that sector by reallocation. This includes also error counters going up in the drive.

    The disadvantage is that if you do this too much for the same drive its error counters will go up and SMART may flag the disk as bad or you may exhaust its reallocation tables. So do use it for manual testing but if you are running it on automated testing don't do it too often.

    0 讨论(0)
  • 2020-12-08 01:55

    To add to mark4o's answer, you can also use Linux's Device Mapper to generate failing devices.

    Device Mapper's delay device can be used to send read and write I/O of the same block to different underlying devices (it can also delay that I/O as its name suggests). Device Mapper's error device can be used to generate permanent errors when a particular block is accessed. By combining the two you can create a device where writes always fail but reads always succeed for a given area.

    The above is a more complicated example of what is described in the question Simulate a faulty block device with read errors? (see https://stackoverflow.com/a/1871029 for a simple Device Mapper example).

    There is also a list of Linux disk fault injection mechanisms on the Special File that causes I/O error Unix & Linux question.

    0 讨论(0)
  • You can also use a low-level SCSI utility (sg3-utils) to stop the drive. It will still respond to Inquiry, so its state will still be "running" but reads and writes will fail until it is started again. I've tested RAID drive removal and recovery using mdadm this way.

    sg_start --stop /dev/sdb
    
    0 讨论(0)
提交回复
热议问题