Python slow read performance issue

前端 未结 1 1717
生来不讨喜
生来不讨喜 2020-12-12 18:49

Following an earlier thread I boiled down my problem to it\'s bare bones, in migrating from a Perl script to a Python one I found a huge performance issue with slurping file

相关标签:
1条回答
  • 2020-12-12 19:42

    I will focus on only one of your examples, because rest things should be analogical:

    What I think, may matter in this situation is Read-Ahead (or maybe another technique related to this) feature:

    Let consider such example:

    I have created 1000 xml files in "1" dir (names 1.xml to 1000.xml) as you did by dd command and then I copied orginal dir 1 to dir 2

    $ mkdir 1
    $ cd 1
    $ for i in {1..1000}; do dd if=/dev/urandom of=$i.xml bs=1K count=10; done
    $ cd ..
    $ cp -r 1 2
    $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
    $ time strace -f -c -o trace.copy2c cp -r 2 2copy
    $ sync; sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'
    $ time strace -f -c -o trace.copy1c cp -r 1 1copy
    

    In the next step I debugged cp command (by strace) to found out in what order data are copied:

    So cp does it in following order (only first 4 files, because I saw that the second read from original directory is more time consuming that second read from copied directory)

    100.xml 150.xml 58.xml 64.xml ... * in my example

    Now, take a look on filesystem blocks which are used by these files (debugfs output - ext3 fs):

    Original directory:

    BLOCKS:
    (0-9):63038-63047 100.xml
    (0-9):64091-64100 150.xml
    (0-9):57926-57935 58.xml
    (0-9):60959-60968 64.xml
    ....
    
    
    Copied directory:
    BLOCKS:
    (0-9):65791-65800 100.xml
    (0-9):65801-65810 150.xml
    (0-9):65811-65820 58.xml
    (0-9):65821-65830 64.xml
    

    ....

    As you can see, in the "Copied directory" the block are adjacent, so it means that during reading of the first file 100.xml the "Read Ahead" technique (controller or system settings) can increase performance.

    dd create file in order 1.xml to 1000.xml, but cp command copies it in another order (100.xml, 150.xml, 58.xml,64.xml). So when you execute:

    cp -r 1 1copy
    

    to copy this dir to another, the blocks of files which you are copied are not adjacent, so read of such files take more time.

    When you copy dir which you copied by cp command (so files are not created by dd command), then file are adjacent so creating:

    cp -r 2 2copy 
    

    copy of the copy is faster.

    Summary: So to test performance python/perl you should use the same dir (or two dirs copied by cp command) and also you can use option O_DIRECT to read bypassing all kernel buffers and read data from disk directly.

    Please remember, that results can be different on different type of kernel, system, disk controller, system settings, fs and so on.

    Additions:

     [debugfs] 
    [root@dhcppc3 test]# debugfs /dev/sda1 
    debugfs 1.39 (29-May-2006)
    debugfs:  cd test
    debugfs:  stat test.xml
    Inode: 24102   Type: regular    Mode:  0644   Flags: 0x0   Generation: 3385884179
    User:     0   Group:     0   Size: 4
    File ACL: 0    Directory ACL: 0
    Links: 1   Blockcount: 2
    Fragment:  Address: 0    Number: 0    Size: 0
    ctime: 0x543274bf -- Mon Oct  6 06:53:51 2014
    atime: 0x543274be -- Mon Oct  6 06:53:50 2014
    mtime: 0x543274bf -- Mon Oct  6 06:53:51 2014
    BLOCKS:
    (0):29935
    TOTAL: 1
    
    debugfs:  
    
    0 讨论(0)
提交回复
热议问题