subprocess.Popen stdin read file

前端 未结 3 1223
一整个雨季
一整个雨季 2020-12-03 11:41

I\'m trying to call a process on a file after part of it has been read. For example:

with open(\'in.txt\', \'r\') as a, open(\'out.txt\', \'w\') as b:
  hea         


        
相关标签:
3条回答
  • 2020-12-03 12:05

    It happens because the subprocess module extracts the File handle from the File Object.

    http://hg.python.org/releasing/2.7.6/file/ba31940588b6/Lib/subprocess.py

    In line 1126, coming from 701.

    The File Object uses buffers and has already read a lot from the file handle when the subprocess extracts it.

    0 讨论(0)
  • 2020-12-03 12:22

    As mentioned by @jfs When using popen it passes the file descriptor to the process, At the same time python reads in chunks (e.g. 4096 bytes), The result is that the position at the fd level is different than what you would expect.

    I solved it in python 2.7 by aligning the file descriptor position.

    _file = open(some_path)
    _file.read(codecs.BOM_UTF8)
    os.lseek(_file.fileno(), _file.tell(), os.SEEK_SET)
    truncate_null_cmd = ['tr','-d', '\\000']
    subprocess.Popen(truncate_null_cmd, stdin=_file, stdout=subprocess.PIPE)
    
    0 讨论(0)
  • 2020-12-03 12:25

    If you open the file unbuffered then it works:

    import subprocess
    
    with open('in.txt', 'rb', 0) as a, open('out.txt', 'w') as b:
        header = a.readline()
        rc = subprocess.call(['sort'], stdin=a, stdout=b)
    

    subprocess module works at a file descriptor level (low-level unbuffered I/O of the operating system). It may work with os.pipe(), socket.socket(), pty.openpty(), anything with a valid .fileno() method if OS supports it.

    It is not recommended to mix the buffered and unbuffered I/O on the same file.

    On Python 2, file.flush() causes the output to appear e.g.:

    import subprocess
    # 2nd
    with open(__file__) as file:
        header = file.readline()
        file.seek(file.tell()) # synchronize (for io.open and Python 3)
        file.flush()           # synchronize (for C stdio-based file on Python 2)
        rc = subprocess.call(['cat'], stdin=file)
    

    The issue can be reproduced without subprocess module with os.read():

    #!/usr/bin/env python
    # 2nd
    import os
    
    with open(__file__) as file: #XXX fully buffered text file EATS INPUT
        file.readline() # ignore header line
        os.write(1, os.read(file.fileno(), 1<<20))
    

    If the buffer size is small then the rest of the file is printed:

    #!/usr/bin/env python
    # 2nd
    import os
    
    bufsize = 2 #XXX MAY EAT INPUT
    with open(__file__, 'rb', bufsize) as file:
        file.readline() # ignore header line
        os.write(2, os.read(file.fileno(), 1<<20))
    

    It eats more input if the first line size is not evenly divisible by bufsize.

    The default bufsize and bufsize=1 (line-buffered) behave similar on my machine: the beginning of the file vanishes -- around 4KB.

    file.tell() reports for all buffer sizes the position at the beginning of the 2nd line. Using next(file) instead of file.readline() leads to file.tell() around 5K on my machine on Python 2 due to the read-ahead buffer bug (io.open() gives the expected 2nd line position).

    Trying file.seek(file.tell()) before the subprocess call doesn't help on Python 2 with default stdio-based file objects. It works with open() functions from io, _pyio modules on Python 2 and with the default open (also io-based) on Python 3.

    Trying io, _pyio modules on Python 2 and Python 3 with and without file.flush() produces various results. It confirms that mixing buffered and unbuffered I/O on the same file descriptor is not a good idea.

    0 讨论(0)
提交回复
热议问题