TL;DR: If the Linux kernel loses a buffered I/O write, is there any way for the application to find out?
I know you have to fsync()
fsync()
returns -EIO
if the kernel lost a write(Note: early part references older kernels; updated below to reflect modern kernels)
It looks like async buffer write-out in end_buffer_async_write(...) failures set an -EIO flag on the failed dirty buffer page for the file:
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
which is then detected by wait_on_page_writeback_range(...)
as called by do_sync_mapping_range(...)
as called by sys_sync_file_range(...)
as called by sys_sync_file_range2(...)
to implement the C library call fsync()
.
This comment on sys_sync_file_range
168 * SYNC_FILE_RANGE_WAIT_BEFORE and SYNC_FILE_RANGE_WAIT_AFTER will detect any
169 * I/O errors or ENOSPC conditions and will return those to the caller, after
170 * clearing the EIO and ENOSPC flags in the address_space.
suggests that when fsync()
returns -EIO
or (undocumented in the manpage) -ENOSPC
, it will clear the error state so a subsequent fsync()
will report success even though the pages never got written.
Sure enough wait_on_page_writeback_range(...)
clears the error bits when it tests them:
301 /* Check for outstanding write errors */
302 if (test_and_clear_bit(AS_ENOSPC, &mapping->flags))
303 ret = -ENOSPC;
304 if (test_and_clear_bit(AS_EIO, &mapping->flags))
305 ret = -EIO;
So if the application expects it can re-try fsync()
until it succeeds and trust that the data is on-disk, it is terribly wrong.
I'm pretty sure this is the source of the data corruption I found in the DBMS. It retries fsync()
and thinks all will be well when it succeeds.
The POSIX/SuS docs on fsync() don't really specify this either way:
If the fsync() function fails, outstanding I/O operations are not guaranteed to have been completed.
Linux's man-page for fsync() just doesn't say anything about what happens on failure.
So it seems that the meaning of fsync()
errors is "dunno what happened to your writes, might've worked or not, better try again to be sure".
On 4.9 end_buffer_async_write sets -EIO
on the page, just via mapping_set_error
.
buffer_io_error(bh, ", lost async page write");
mapping_set_error(page->mapping, -EIO);
set_buffer_write_io_error(bh);
clear_buffer_uptodate(bh);
SetPageError(page);
On the sync side I think it's similar, though the structure is now pretty complex to follow. filemap_check_errors
in mm/filemap.c
now does:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
which has much the same effect. Error checks seem to all go through filemap_check_errors which does a test-and-clear:
if (test_bit(AS_EIO, &mapping->flags) &&
test_and_clear_bit(AS_EIO, &mapping->flags))
ret = -EIO;
return ret;
I'm using btrfs
on my laptop, but when I create an ext4
loopback for testing on /mnt/tmp
and set up a perf probe on it:
sudo dd if=/dev/zero of=/tmp/ext bs=1M count=100
sudo mke2fs -j -T ext4 /tmp/ext
sudo mount -o loop /tmp/ext /mnt/tmp
sudo perf probe filemap_check_errors
sudo perf record -g -e probe:end_buffer_async_write -e probe:filemap_check_errors dd if=/dev/zero of=/mnt/tmp/test bs=4k count=1 conv=fsync
I find the following call stack in perf report -T
:
---__GI___libc_fsync
entry_SYSCALL_64_fastpath
sys_fsync
do_fsync
vfs_fsync_range
ext4_sync_file
filemap_write_and_wait_range
filemap_check_errors
A read-through suggests that yeah, modern kernels behave the same.
This seems to mean that if fsync()
(or presumably write()
or close()
) returns -EIO
, the file is in some undefined state between when you last successfully fsync()
d or close()
d it and its most recently write()
ten state.
I've implemented a test case to demonstrate this behaviour.
A DBMS can cope with this by entering crash recovery. How on earth is a normal user application supposed to cope with this? The fsync()
man page gives no warning that it means "fsync-if-you-feel-like-it" and I expect a lot of apps won't cope well with this behaviour.
lwn.net touched on this in the article "Improved block-layer error handling".
postgresql.org mailing list thread.
write
(2) provides less than you expect. The man page is very open about the semantic of a successful write()
call:
A successful return from
write()
does not make any guarantee that data has been committed to disk. In fact, on some buggy implementations, it does not even guarantee that space has successfully been reserved for the data. The only way to be sure is to callfsync
(2) after you are done writing all your data.
We can conclude that a successful write()
merely means that the data has reached the kernel's buffering facilities. If persisting the buffer fails, a subsequent access to the file descriptor will return the error code. As last resort that may be close()
. The man page of the close
(2) system call contains the following sentence:
It is quite possible that errors on a previous
write
(2) operation are first reported at the finalclose
().
If your application needs to persist data write away it has to use fsync
/fsyncdata
on a regular basis:
fsync()
transfers ("flushes") all modified in-core data of (i.e., modified buffer cache pages for) the file referred to by the file descriptor fd to the disk device (or other permanent storage device) so that all changed information can be retrieved even after the system crashed or was rebooted. This includes writing through or flushing a disk cache if present. The call blocks until the device reports that the transfer has completed.
Since the application's write() will have already returned without error, there seems to be no way to report an error back to the application.
I do not agree. write
can return without error if the write is simply queued, but the error will be reported on the next operation that will require the actual writing on disk, that means on next fsync
, possibly on a following write if the system decides to flush the cache and at least on last file close.
That is the reason why it is essential for application to test the return value of close to detect possible write errors.
If you really need to be able to do clever error processing you must assume that everything that was written since the last successful fsync
may have failed and that in all that at least something has failed.
Use the O_SYNC flag when you open the file. It ensures the data is written to the disk.
If this won't satisfy you, there will be nothing.
Check the return value of close. close can fail whilst buffered writes appear to succeed.