How can I use 100% of VRAM on a secondary GPU from a single process on windows 10?

问题

This is on windows 10 computer with no monitor attached to the Nvidia card. I've included output from nvida-smi showing > 5.04G was available.

Here is the tensorflow code asking it to allocate just slightly more than I had seen previously: (I want this to be as close as possible to memory fraction=1.0)

config = tf.ConfigProto()
#config.gpu_options.allow_growth=True
config.gpu_options.per_process_gpu_memory_fraction=0.84
config.log_device_placement=True
sess = tf.Session(config=config)

Just before running the above line in a jupyter notebook I ran nvida-smi:

    +-----------------------------------------------------------------------------+
| NVIDIA-SMI 376.51                 Driver Version: 376.51                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 106... WDDM  | 0000:01:00.0     Off |                  N/A |
|  0%   27C    P8     5W / 120W |     43MiB /  6144MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

Output from TF after it successfully allocates 5.01GB, shows "failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY" (you need to scroll to the right to see it below)

2017-12-17 03:53:13.959871: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1030] Found device 0 with properties:
name: GeForce GTX 1060 6GB major: 6 minor: 1 memoryClockRate(GHz): 1.7845
pciBusID: 0000:01:00.0
totalMemory: 6.00GiB freeMemory: 5.01GiB
2017-12-17 03:53:13.960006: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\gpu\gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1)
2017-12-17 03:53:13.961152: E C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\stream_executor\cuda\cuda_driver.cc:936] failed to allocate 5.04G (5411658752 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1
2017-12-17 03:53:14.151073: I C:\tf_jenkins\home\workspace\rel-win\M\windows-gpu\PY\35\tensorflow\core\common_runtime\direct_session.cc:299] Device mapping:
/job:localhost/replica:0/task:0/device:GPU:0 -> device: 0, name: GeForce GTX 1060 6GB, pci bus id: 0000:01:00.0, compute capability: 6.1

My best guess is some policy in an Nvidia user level dll is preventing use of all of the memory (perhaps to allow for attaching a monitor?)

If that theory is correct I'm looking for any user accessible knob to turn that off on windows 10. If I'm on the wrong track any help to point in the right direction is appreciated.

Edit #1:

I realized I did not include this bit of research: The following code in tensorflow indicates stream_exec is 'telling' TensorFlow that only 5.01GB is free. This is the primary reason for my current theory that some Nvidia component is preventing the allocation. (However I could be misunderstanding what component implements the instantiated stream_exec.)

auto stream_exec = executor.ValueOrDie();
int64 free_bytes;
int64 total_bytes;
if (!stream_exec->DeviceMemoryUsage(&free_bytes, &total_bytes)) {
  // Logs internally on failure.
  free_bytes = 0;
  total_bytes = 0;
}
const auto& description = stream_exec->GetDeviceDescription();
int cc_major;
int cc_minor;
if (!description.cuda_compute_capability(&cc_major, &cc_minor)) {
  // Logs internally on failure.
  cc_major = 0;
  cc_minor = 0;
}
LOG(INFO) << "Found device " << i << " with properties: "
          << "\nname: " << description.name() << " major: " << cc_major
          << " minor: " << cc_minor
          << " memoryClockRate(GHz): " << description.clock_rate_ghz()
          << "\npciBusID: " << description.pci_bus_id() << "\ntotalMemory: "
          << strings::HumanReadableNumBytes(total_bytes)
          << " freeMemory: " << strings::HumanReadableNumBytes(free_bytes);
}

Edit #2:

The thread below indicates Windows 10 is preventing full use of VRAM pervasively across secondary video cards used for compute by grabbing a % of the VRAM: https://social.technet.microsoft.com/Forums/windows/en-US/15b9654e-5da7-45b7-93de-e8b63faef064/windows-10-does-not-let-cuda-applications-to-use-all-vram-on-especially-secondary-graphics-cards?forum=win10itprohardware

This thread seems implausible given it would mean all windows 10 boxes are inherently worse than windows 7 for anything where VRAM on compute dedicated graphics cards could plausibly be the bottleneck.

Edit #3:

Update title to more clearly be a question. Feedback indicates this may be better as a bug to Microsoft or Nvidia. I am pursuing other avenues to get this addressed. However I don't want to assume this cannot be resolved directly.
Further experiments do indicate that the issue I am hitting is for the case of a large allocation from a single process. All of the VRAM can be used when another process comes into play.

Edit #4

The failure here is an allocation failure, and according to the NVIDIA-SMI above I have 43MiB in use (perhaps by the system?), but not by an identifiable process. The type of failure I'm seeing is of a monolithic single allocation. Under a typical allocation model that requires a continuous address space. So the pertinent question may be: What is causing that 43MiB to be used? Is that placed in the address space such that the 5.01 GB allocation is the max contiguous space available?

回答1:

It is clearly not possible for now, as Windows Display Driver Model 2.x has a limit defined, and no process can override it {Legally}.

Assuming you have played with "Prefer Maximum Performance Setting" with that you can push it to at max 92% with Power Supply.

This would help you in detail, if you like to know more about the WDDM 2.x:

https://docs.microsoft.com/en-us/windows-hardware/drivers/display/what-s-new-for-windows-threshold-display-drivers--wddm-2-0-

回答2:

I believe, for cards that support the TCC driver, it is a solvable problem. Sadly my 1060 GTX does not appear to support that.

I would need such a card to verify. Absent someone producing a solution that works on a GTX 1060, I'd definitely release the bounty to someone capable of demonstrating a single process using 100% of VRAM on windows 10 with the TCC driver.

来源：https://stackoverflow.com/questions/47855185/how-can-i-use-100-of-vram-on-a-secondary-gpu-from-a-single-process-on-windows-1

标签

tensorflow

cuda

windows-10

nvidia