My questions are:
1) Did I understand correct, that when you declare a variable in the global kernel, there will be different copies of this variable for each threa
1) Yes. Each thread has a private copy of non-shared variables declared in the function. These usually go into GPU register
memory, though can spill into local
memory.
2), 3) and 4) While it's true that you need many copies of that private memory, that doesn't mean your GPU has to have enough private memory for every thread at once. This is because in hardware, not all threads need to execute simultaneously. For example, if you launch N threads it may be that half are active at a given time and the other half won't start until there are free resources to run them.
The more resources your threads use the fewer can be run simultaneously by the hardware, but that doesn't limit how many you can ask to be run, as any threads the GPU doesn't have resource for will be run once some resources free up.
This doesn't mean you should go crazy and declare massive amounts of local resources. A GPU is fast because it is able to run threads in parallel. To run these threads in parallel it needs to fit a lot of threads at any given time. In a very general sense, the more resources you use per thread, the fewer threads will be active at a given moment, and the less parallelism the hardware can exploit.