Replacing arrays access variables with the right integer type

问题

I've had a habit of using int to access arrays (especially in for loops); however I recently discovered that I may have been "doing-it-all-wrong" and my x86 system kept hiding the truth from me. It turns out that int is fine when sizeof(size_t) == sizeof(int) but when used on a system where sizeof(size_t) > sizeof(int), it causes an additional mov instruction. size_t and ptrdiff_t seem to be the optimal way on the systems I've tested, requiring no additional mov.

Here is a shortened example

int vector_get(int *v,int i){ return v[i]; }

    > movslq    %esi, %rsi
    > movl  (%rdi,%rsi,4), %eax
    > ret

int vector_get(int *v,size_t i){ return v[i]; }

    > movl  (%rdi,%rsi,4), %eax
    > ret

OK, I've fixed myself (using size_t and ptrdiff_t now), now how do I (hopefully not manually) find these instances in my code so I can fix them?

Recently I've noticed several patches including changes from int to size_t coming across the wire mentioning Clang.

I put together a table of the extra instructions that get inserted on each instance to show the results of "doing-it-all-wrong".

char short int unsigned char unsigned short unsigned int movsbq %sil, %rsi movswq %si, %rsi movslq %esi, %rsi movzbl %sil, %esi movzwl %si, %esi movl %esi, %esi Table of unwanted move operations when accessing vectors with "wrong" type.

Note: long, long long, unsigned long, unsigned long long, size_t and ptrdiff_t require no additional mov* operation (basically anything >= largest object size, or 8 bytes on the 64 bit reference system )

Edit:

I think I may have a workable stub for patching gcc, but I don't know my way around its source to complete the stub and add proper -Wflag bits, and as usual the hardest part of programming is naming stuff. -Wunalinged-index?

gcc/c/c-typeck.c _______________________________________________

if (!swapped)
    warn_array_subscript_with_type_char (index);
> 
> if ( sizeof(index) < sizeof(size_t) ) 
>   warning_at (loc, OPT_Wunaligned_index,
>       "array index is smaller than size_t");

/* Apply default promotions *after* noticing character types.  */
index = default_conversion (index);

gcc/c-family/c.opt _____________________________________________

trigraphs
C ObjC C++ ObjC++
-trigraphs  Support ISO C trigraphs
> 
> Wunaligned-index
> C ObjC C++ ObjC++
> Warn about array indices smaller than size_t

undef
C ObjC C++ ObjC++ Var(flag_undef)
Do not predefine system-specific and GCC-specific macros

gcc/c-family/c-opts.c __________________________________________

case OPT_Wtrigraphs:
  cpp_opts->warn_trigraphs = value;
  break;
>
> case OPT_Wunaligned_index:
>   cpp_opts->warn_unaligned_index = value;
>

case OPT_Wundef:
  cpp_opts->warn_undef = value;
  break;

回答1:

clang and gcc have -Wchar-subscripts, but that'll only help detect char subscript types.

You might consider modifying clang or gcc (whichever is easier to build on your infrastructure) to broaden the types detected by the -Wchar-subscripts warning. If this is a one-pass fix effort, this might be the most straightforward way to go about it.

Otherwise you'll need to find a linter that complains about non-size_t/ptrdiff_t subscripting; I'm not aware of any that have that option.

回答2:

The movslq instruction sign-extends a long (aka 4-byte quantity) to a quad (aka 8-byte quantity). This is because int is signed, so an offset of i.e. -1 is 0xffffffff as a long. If you were to just zero-extend that (i.e. not have movslq), this would be 0x00000000ffffffff, aka 4294967295, which is probably not what you want. So, the compiler instead sign-extends the index to yield 0xffff..., aka -1.

The reason the other types don't require the additional operation is because, despite some of them being signed, they're still the same size of 8 bytes. And, thanks to two's complement, 0xffff... can be interpreted as either -1 or 18446744073709551615, and the 64-bit sum will still be the same.

Now, normally, if you were to instead use unsigned int, the compiler would normally have to insert a zero-extend instead, just to make sure the upper-half of the register doesn't contain garbage. However, on the x64 platform, this is done implicitly; an instruction such as mov %eax,%esi will move whatever 4-byte quantity is in eax into the lower 4 bytes of rsi and clear the upper 4, effectively zero-extending the quantity. But, given your postings, the compiler seems to insert mov %esi,%esi instruction anyway, "just to be sure".

Note, however, that this "automatic zero-extending" is not the case for 1- and 2-byte quantities - those must be manually zero-extended.

来源：https://stackoverflow.com/questions/24864103/replacing-arrays-access-variables-with-the-right-integer-type

标签

arrays

int

static-analysis

compiler-optimization