Large file not flushed to disk immediately after calling close()?

余生长醉 提交于 2019-12-21 03:36:05

问题


I'm creating large file with my python script (more than 1GB, actually there's 8 of them). Right after I create them I have to create process that will use those files.

The script looks like:

# This is more complex function, but it basically does this:
def use_file():
    subprocess.call(['C:\\use_file', 'C:\\foo.txt']);


f = open( 'C:\\foo.txt', 'wb')
for i in 10000:
    f.write( one_MB_chunk)
f.flush()
os.fsync( f.fileno())
f.close()

time.sleep(5) # With this line added it just works fine

t = threading.Thread( target=use_file)
t.start()

But application use_file acts like foo.txt is empty. There are some weird things going on:

  • if I execute C:\use_file C:\foo.txt in console (after script finished) I get correct results
  • if I execute manually use_file() in another python console I get correct results
  • C:\foo.txt is visible on disk right after open() was called, but remains size 0B until the end of script
  • if I add time.sleep(5) it just starts working as expected (or rather required)

I've already found:

  • os.fsync() but it doesn't seem to work (result from use_file is as if C:\foo.txt was empty)
  • Using buffering=(1<<20) (when opening file) doesn't seem to work either

I'm more and more curious about this behaviour.

Questions:

  • Does python fork close() operation into background? Where is this documented?
  • How to work this around?
  • Am I missing something?
  • After adding sleep: is that a windows/python bug?

Notes: (for the case that there's something wrong with the other side) application use_data uses:

handle = CreateFile("foo.txt", GENERIC_READ, FILE_SHARE_READ, NULL,
                               OPEN_EXISTING, 0, NULL);
size = GetFileSize(handle, NULL)

And then processes size bytes from foo.txt.


回答1:


f.close() calls f.flush(), which sends the data to the OS. That doesn't necessarily write the data to disk, because the OS buffers it. As you rightly worked out, if you want to force the OS to write it to disk, you need to os.fsync().

Have you considered just piping the data directly into use_file?


EDIT: you say that os.fsync() 'doesn't work'. To clarify, if you do

f = open(...)
# write data to f
f.flush()
os.fsync(f.fileno())
f.close()

import pdb; pdb.set_trace()

and then look at the file on disk, does it have data?




回答2:


Edit: updated with information specific to Python 3.x

There is a super old bug report discussing a suspiciosly similar problem at https://bugs.python.org/issue4944. I made a small test that shows the bug: https://gist.github.com/estyrke/c2f5d88156dcffadbf38

After getting a wonderful explanation from user eryksun at the bug link above, I now understand why this happens, and it is not a bug per se. When a child process is created on Windows, by default it inherits all open file handles from the parent process. So what you're seeing is probably actually a sharing violation because the file you're trying to read in the child process is open for writing through an inherited handle in another child process. A possible sequence of events that causes this (using the reproduction example at the Gist above):

Thread 1 opens file 1 for writing
  Thread 2 opens file 2 for writing
  Thread 2 closes file 2
  Thread 2 launches child 2
  -> Inherits the file handle from file 1, still open with write access
Thread 1 closes file 1
Thread 1 launches child 1
-> Now it can't open file 1, because the handle is still open in child 2
Child 2 exits
-> Last handle to file 1 closed
Child 1 exits

When I compile the simple C child program and run the script on my machine, it fails in at least one of the threads most of the time with Python 2.7.8. With Python 3.2 and 3.3 the test script without redirection does not fail, because the default value of the close_fds argument to subprocess.call is now True when redirection is not used. The other test script using redirection still fails in those versions. In Python 3.4 both tests succeed, because of PEP 446 which makes all file handles non-inheritable by default.

Conclusion

Spawning a child process from a thread in Python means the child inherits all open file handles, even from other threads than the one where the child is spawned. This is, at least for me, not particularly intuitive.

Possible solutions:

  • Upgrade to Python 3.4, where file handles are non-inheritable by default.
  • Pass close_fds=True to subprocess.call to disable inheriting altogether (this is the default in Python 3.x). Note though that this prevents redirection of the child process' standard input/output/error.
  • Make sure all files are closed before spawning new processes.
  • Use os.open to open files with the os.O_NOINHERIT flag on Windows.
    • tempfile.mkstemp also uses this flag.
  • Use the win32api instead. Passing a NULL pointer for the lpSecurityAttributes parameter also prevents inheriting the descriptor:

    from contextlib import contextmanager
    import win32file
    
    @contextmanager
    def winfile(filename):
        try:
            h = win32file.CreateFile(filename, win32file.GENERIC_WRITE, 0, None, win32file.CREATE_ALWAYS, 0, 0)
            yield h
        finally:
            win32file.CloseHandle(h)
    
    with winfile(tempfilename) as infile:
        win32file.WriteFile(infile, data)
    


来源:https://stackoverflow.com/questions/13761961/large-file-not-flushed-to-disk-immediately-after-calling-close

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!