How to eager commit allocated memory in C++?

后端 未结 1 1497
梦谈多话
梦谈多话 2021-02-02 12:03

The General Situation

An application that is extremely intensive on both bandwidth, CPU usage, and GPU usage needs to transfer about 10-15GB per second

1条回答
  •  情歌与酒
    2021-02-02 12:14

    Current workaround, simplified pseudo code:

    // During startup
    {
        SetProcessWorkingSetSize(GetCurrentProcess(), 2*1024*1024*1024, -1);
    }
    // In the DX11 render loop thread
    {
        DX11context->Map(..., &resource)
        VirtualLock(resource.pData, resource.size);
        notify();
        wait();
        DX11context->Unmap(...);
    }
    // In the processing threads
    {
        wait();
        std::memcpy(buffer, source, size);
        signal();
    }
    

    VirtualLock() forces the kernel to back the specified address range with RAM immediately. The call to the complementing VirtualUnlock() function is optional, it happens implicitly (and at no extra cost) when the address range is unmapped from the process. (If called explicitly, it costs about 1/3rd of the locking cost.)

    In order for VirtualLock() to work at all, SetProcessWorkingSetSize() needs to be called first, as the sum of all memory regions locked by VirtualLock() can not exceed the minimum working set size configured for the process. Setting the "minimum" working set size to something higher than the baseline memory footprint of your process has no side effects unless your system is actually potentially swapping, your process will still not consume more RAM than the actual working set size.


    Just the use of VirtualLock(), albeit in individual threads and using deferred DX11 contexts for Map / Unmap calls, did instantly decrease the performance penalty from 40-50% to slightly more acceptable 15%.

    Discarding the use of a deferred context, and exclusively triggering both all soft faults, as well as the corresponding de-allocation when unmapping on a single thread, gave the necessary performance boost. The total cost of that spin-lock is now down to <1% of the total CPU usage.


    Summary?

    When you expect soft faults on Windows, try what you can to keep them all in the same thread. Performing a parallel memcpy itself is unproblematic, in some situations even necessary to fully utilize the memory bandwidth. However, that is only if the memory is already committed to RAM yet. VirtualLock() is the most efficient way to ensure that.

    (Unless you are working with an API like DirectX which maps memory into your process, you are unlikely to encounter uncommitted memory frequently. If you are just working with standard C++ new or malloc your memory is pooled and recycled inside your process anyway, so soft faults are rare.)

    Just make sure to avoid any form of concurrent page faults when working with Windows.

    0 讨论(0)
提交回复
热议问题