I\'d like to thank Stephen for the very quick reply in a previous post. This is a follow up question for this post Why very simple Renderscript runs 3 times slower in GPU than i
There are really two answers to this question.
1: Don't believe the hype regarding GPUs. For some workloads they are faster. However, for many workloads, the difference is small or negative. You have at least 2 different processor types, don't worry about which one get used, only worry if the performance is what you want.
2: For performance tuning I would really focus on the algorithm and avoiding slow operations. Examples:
Prefer float to double when float provides adequate precision.
Use RS_FP_RELAXED when you don't need IEEE-754 compliance
Prefer multiplication to division
use native_* (ex: native_powr) in place of the full precision routines where the precision is adequate
Use rsGetElementAt_* over rsSample or rsGetElementAt. The typed version of get are faster that the general get and much faster than rsSample in many cases.
loads from script globals are typically faster than loads from an rs_allocation. Prefer global for kernel constants.
3: There are some performance issues with global loads today on the Nexus (4,5,7v2) GPU path. These will be improved with updates.