问题

I have a sampling application that acquires 250,000 samples per second, buffers them in memory and eventually appends to an HDFStore provided by pandas. In general, this is great. However, I have a thread that runs and continually empties the data acquisition device (DAQ) and it needs to run on a somewhat regular basis. A deviation of about a second tends to break things. Below is an extreme case of the timings observed. Start indicates a DAQ read starting, Finish is when it finishes, and IO indicates an HDF write (Both DAQ and IO occur in separate threads).

Start        : 2016-04-07 12:28:22.241303
IO (1)       : 2016-04-07 12:28:22.241303
Finish       : 2016-04-07 12:28:46.573440 (0.16 Hz, 24331.26 ms)
IO Done (1)  : 2016-04-07 12:28:46.573440 (24332.39 ms)

As you can see, it takes 24 seconds to perform this write (a typical write is about 40 ms). The HDD that I'm writing to is not under load, so this delay shouldn't be caused by contention (it's got about ~7% utilisation while running). I have disabled indexing on my HDFStore writes. My application runs numerous other threads, all of which print status strings, and therefore it seems like the IO task is blocking all other threads. I've spent quite a bit of time stepping through code to figure out where things are slowing down, and it's always within a method provided by a C extension, and this leads to my question..

Can Python (I'm using 3.5) preempt execution in a C extension? Concurrency: Are Python extensions written in C/C++ affected by the Global Interpreter Lock? Seems to indicate that it doesn't unless the extension specifically yields.
Does Pandas' HDF5 C code implement any yielding for I/O? If so, does this mean that the delay is due to a CPU bounded task? I have disabled indexing.
Any suggestions for how I can get somewhat consistent timings? I'm thinking of moving the HDF5 code into another process. This only helps to a certain extent, though, as I can't really tolerate ~20 second writes anyway, especially when they're unpredictable.

Here's an example you can run to see the issue:

import pandas as pd
import numpy as np
from timeit import default_timer as timer
import datetime
import random
import threading
import time

def write_samples(store, samples, overwrite):
    frame = pd.DataFrame(samples, dtype='float64')

    if not overwrite:
        store.append("df", frame, format='table', index=False)
    else:
        store.put("df", frame, format='table', index=False)

def begin_io():
    store = pd.HDFStore("D:\\slow\\test" + str(random.randint(0,100)) + ".h5", mode='w', complevel=0)

    counter = 0
    while True:
        data = np.random.rand(50000, 1)
        start_time = timer()
        write_samples(store, data, counter == 0)
        end_time = timer()

        print("IO Done      : %s (%.2f ms, %d)" % (datetime.datetime.now(), (end_time - start_time) * 1000, counter))

        counter += 1

    store.close()

def dummy_thread():
    previous = timer()
    while True:
        now = timer()
        print("Dummy Thread  : %s (%d ms)" % (datetime.datetime.now(), (now - previous) * 1000))
        previous = now
        time.sleep(0.01)


if __name__ == '__main__':
    threading.Thread(target=dummy_thread).start()
    begin_io()

You will get output similar to:

IO Done      : 2016-04-08 10:51:14.100479 (3.63 ms, 470)
Dummy Thread  : 2016-04-08 10:51:14.101484 (12 ms)
IO Done      : 2016-04-08 10:51:14.104475 (3.01 ms, 471)
Dummy Thread  : 2016-04-08 10:51:14.576640 (475 ms)
IO Done      : 2016-04-08 10:51:14.576640 (472.00 ms, 472)
Dummy Thread  : 2016-04-08 10:51:14.897756 (321 ms)
IO Done      : 2016-04-08 10:51:14.898782 (320.79 ms, 473)
IO Done      : 2016-04-08 10:51:14.901772 (3.29 ms, 474)
IO Done      : 2016-04-08 10:51:14.905773 (2.84 ms, 475)
IO Done      : 2016-04-08 10:51:14.908775 (2.96 ms, 476)
Dummy Thread  : 2016-04-08 10:51:14.909777 (11 ms)

回答1:

The answer is no, these writers do not release the GIL. See the documentation here. I know you are not actually trying to write with multiple threads, but this should clue you in. There are strong locks that are held when writes happen to really to prevent multiple writing. Both PyTables and h5py do this as its part of the HDF5 standards.

You can look at SWMR, though not directly supported by pandas. PyTables docs here and here point to solutions. These generally involved having a separate process pulling data off of queues and writing it.

This is in generally a much more scalable pattern in any event.

回答2:

Thanks for providing working code. I have modified that to get some insight and later created modified version using multiprocessing.

Modified threading version

All the modifications are just to get more information out, no conceptual changes. All goes into one file mthread.py and is commented part by part.

Imports as usually:

import pandas as pd
import numpy as np
from timeit import default_timer as timer
import datetime
import random
import threading
import logging

write_samples got some logging:

def write_samples(store, samples, overwrite):
    wslog = logging.getLogger("write_samples")
    wslog.info("starting")
    frame = pd.DataFrame(samples, dtype='float64')

    if overwrite:
        store.put("df", frame, format='table', index=False)
    else:
        store.append("df", frame, format='table', index=False)
    wslog.info("finished")

begin_io got maximal duaration, exceeding that time results in WARNING log entry:

def begin_io(maxduration=500):
    iolog = logging.getLogger("begin_io")
    iolog.info("starting")
    try:
        fname = "data/tab" + str(random.randint(0, 100)) + ".h5"
        iolog.debug("opening store %s", fname)
        with pd.HDFStore(fname, mode='w', complevel=0) as store:
            iolog.debug("store %s open", fname)

            counter = 0
            while True:
                data = np.random.rand(50000, 1)
                start_time = timer()
                write_samples(store, data, counter == 0)
                end_time = timer()
                duration = (end_time - start_time) * 1000
                iolog.debug("IO Done      : %s (%.2f ms, %d)",
                            datetime.datetime.now(),
                            duration,
                            counter)
                if duration > maxduration:
                    iolog.warning("Long duration %s", duration)
                counter += 1
    except Exception:
        iolog.exception("oops")
    finally:
        iolog.info("finished")

dummy_thread got modified to stop properly and also emits WARNING, if takes too long:

def dummy_thread(pill2kill, maxduration=500):
    dtlog = logging.getLogger("dummy_thread")
    dtlog.info("starting")
    try:
        previous = timer()
        while not pill2kill.wait(0.01):
            now = timer()
            duration = (now - previous) * 1000
            dtlog.info("Dummy Thread  : %s (%d ms)",
                       datetime.datetime.now(),
                       duration)
            if duration > maxduration:
                dtlog.warning("Long duration %s", duration)
            previous = now
        dtlog.debug("stopped looping.")
    except Exception:
        dtlog.exception("oops")
    finally:
        dtlog.info("finished")

and finally we call it all. Feel free to modify log levels, WARNING shows just excessive times, INFO and DEBUG tell much much more.

if __name__ == '__main__':
    logformat = '%(asctime)-15s [%(levelname)s] - %(name)s: %(message)s'
    logging.basicConfig(format=logformat,
                        level=logging.WARNING)

    pill2kill = threading.Event()
    t = threading.Thread(target=dummy_thread, args=(pill2kill, 500))
    t.start()
    try:
        begin_io(500)
    finally:
        pill2kill.set()
        t.join()

Running the code I get results as you described:

2016-04-08 15:29:11,428 [WARNING] - begin_io: Long duration 5169.03591156
2016-04-08 15:29:11,429 [WARNING] - dummy_thread: Long duration 5161.45706177
2016-04-08 15:29:27,305 [WARNING] - begin_io: Long duration 1447.40581512
2016-04-08 15:29:27,306 [WARNING] - dummy_thread: Long duration 1450.75201988
2016-04-08 15:29:32,893 [WARNING] - begin_io: Long duration 1610.98194122
2016-04-08 15:29:32,894 [WARNING] - dummy_thread: Long duration 1612.98394203
2016-04-08 15:29:34,930 [WARNING] - begin_io: Long duration 823.182821274
2016-04-08 15:29:34,930 [WARNING] - dummy_thread: Long duration 815.275907516
2016-04-08 15:29:43,640 [WARNING] - begin_io: Long duration 510.369062424
2016-04-08 15:29:43,640 [WARNING] - dummy_thread: Long duration 511.776924133

From the values it is clear, that while begin_io is very busy and delayd (probably during data being written to the disk), the dummy_thread is also delayed for almost the same amount of time.

Version with multiprocessing - works well

I have modified the code to run in multiple processes and since then, it really does not block the dummy_thread.

2016-04-08 15:38:12,487 [WARNING] - begin_io: Long duration 755.397796631
2016-04-08 15:38:14,127 [WARNING] - begin_io: Long duration 1434.60512161
2016-04-08 15:38:15,725 [WARNING] - begin_io: Long duration 848.396062851
2016-04-08 15:38:24,290 [WARNING] - begin_io: Long duration 1129.17089462
2016-04-08 15:38:25,609 [WARNING] - begin_io: Long duration 1059.08918381
2016-04-08 15:38:31,165 [WARNING] - begin_io: Long duration 646.969079971
2016-04-08 15:38:37,273 [WARNING] - begin_io: Long duration 1699.17201996
2016-04-08 15:38:43,788 [WARNING] - begin_io: Long duration 1555.341959
2016-04-08 15:38:47,765 [WARNING] - begin_io: Long duration 639.196872711
2016-04-08 15:38:54,269 [WARNING] - begin_io: Long duration 1690.57011604
2016-04-08 15:39:06,397 [WARNING] - begin_io: Long duration 1998.33416939
2016-04-08 15:39:16,980 [WARNING] - begin_io: Long duration 2558.51006508
2016-04-08 15:39:21,688 [WARNING] - begin_io: Long duration 1132.73501396
2016-04-08 15:39:26,450 [WARNING] - begin_io: Long duration 876.784801483
2016-04-08 15:39:29,809 [WARNING] - begin_io: Long duration 709.135055542
2016-04-08 15:39:31,748 [WARNING] - begin_io: Long duration 677.506923676
2016-04-08 15:39:41,854 [WARNING] - begin_io: Long duration 770.184993744

The code with multiprocessing is here:

import pandas as pd
import numpy as np
from timeit import default_timer as timer
import datetime
import random
import multiprocessing
import time
import logging


def write_samples(store, samples, overwrite):
    wslog = logging.getLogger("write_samples")
    wslog.info("starting")
    frame = pd.DataFrame(samples, dtype='float64')

    if overwrite:
        store.put("df", frame, format='table', index=False)
    else:
        store.append("df", frame, format='table', index=False)
    wslog.info("finished")


def begin_io(pill2kill, maxduration=500):
    iolog = logging.getLogger("begin_io")
    iolog.info("starting")
    try:
        fname = "data/tab" + str(random.randint(0, 100)) + ".h5"
        iolog.debug("opening store %s", fname)
        with pd.HDFStore(fname, mode='w', complevel=0) as store:
            iolog.debug("store %s open", fname)

            counter = 0
            while not pill2kill.wait(0):
                data = np.random.rand(50000, 1)
                start_time = timer()
                write_samples(store, data, counter == 0)
                end_time = timer()
                duration = (end_time - start_time) * 1000
                iolog.debug( "IO Done      : %s (%.2f ms, %d)",
                            datetime.datetime.now(),
                            duration,
                            counter)
                if duration > maxduration:
                    iolog.warning("Long duration %s", duration)
                counter += 1
    except Exception:
        iolog.exception("oops")
    finally:
        iolog.info("finished")


def dummy_thread(pill2kill, maxduration=500):
    dtlog = logging.getLogger("dummy_thread")
    dtlog.info("starting")
    try:
        previous = timer()
        while not pill2kill.wait(0.01):
            now = timer()
            duration = (now - previous) * 1000
            dtlog.info( "Dummy Thread  : %s (%d ms)",
                       datetime.datetime.now(),
                       duration)
            if duration > maxduration:
                dtlog.warning("Long duration %s", duration)
            previous = now
        dtlog.debug("stopped looping.")
    except Exception:
        dtlog.exception("oops")
    finally:
        dtlog.info("finished")


if __name__ == '__main__':
    logformat = '%(asctime)-15s [%(levelname)s] - %(name)s: %(message)s'
    logging.basicConfig(format=logformat,
                        level=logging.WARNING)
    pill2kill = multiprocessing.Event()
    dp = multiprocessing.Process(target=dummy_thread, args=(pill2kill, 500,))
    dp.start()
    try:
        p = multiprocessing.Process(target=begin_io, args=(pill2kill, 500,))
        p.start()
        time.sleep(100)
    finally:
        pill2kill.set()
        dp.join()
        p.join()

Conclusions

Writing data to HDF5 file really blocks other threads and multiprocessing version is required.

If you expect the dummy_thread do some real work (like collecting data to store), and you want to send data from here to the HDF5 serializer, you will have to some sort of messaging - either using multiprocessing.Queue, Pipe or possibly use ZeroMQ (e.g. PUSH - PULL socket pair). With ZeroMQ you could do the saving of data even on another computer.

EDIT/WARNING: Provided code can fail saving the data sometime, I made it to measure performance and did not make it waterproof. When Ctrl-C during processing, sometime I get corrupted file. This problem I consider out of scope of this question (and the problem shall be resolved by careful stopping of the running process).

来源：https://stackoverflow.com/questions/36488214/gil-for-io-bounded-thread-in-c-extension-hdf5

标签

python