Introduction: This question is part of my collection of C and C++ (and C/C++ common subset) questions regarding the cases where pointers object with strictly ide
The question, as I understand it, is:
Is memcpy of a pointer the same as assignment?
And my answer would be, yes.
memcpy
is basically an optimized assignment for variable length data that has no memory alignment requirements. It's pretty much the same as:
void slow_memcpy(void * target, void * src, int len) {
char * t = target;
char * s = src;
for (int i = 0; i < len; ++i)
{
t[i] = s[i];
}
}
is a pointer's semantic "value" (its behavior according to the specification) determined only by its numerical value (the numerical address it contains), for a pointer of a given type?
Yes. There are no hidden data fields is C, so the pointer's behavior is totally dependant on it's numerical data content.
However, pointer arithmetics is resolved by the compiler and depends on the pointer's type.
A char * str
pointer arithmetics will be using char
units (i.e., str[1]
is one char
away from str[0]
), while an int * p_num
pointer arithmetics will be using int
units (i.e., p_num[1]
is one int
away from p_num[0]
).
Are two pointers with identical bit patterns allowed to have different behavior? (edit)
Yes and no.
They point to the same location in the memory and in this sense they are identical.
However, pointer resolution might depend on the pointer's type.
For example, by dereferencing a uint8_t *
, only 8 bits are read from the memory (usually). However, when dereferencing a uint64_t *
, 64 bits are read from the memory address.
Another difference is pointer arithmetics, as described above.
However, when using functions such as memcpy
or memcmp
, than the pointers will behave the same.
Well, that's because the code in your example doesn't reflect the question in the title. The code’s behavior is undefined, as clearly explained by the many answers.
(edit):
The issues with the code have little to do with the actual question.
Consider, for example, the following line:
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
In this case, pa
points to a[1]
, which is out of bounds.
This pretty much throws the code into undefined behavior territory, which distracted many answers away from the actual question.
The question was:
Is this program valid in all cases?
The answer is "no, it is not".
The only interesting part of the program is what happens within the block guarded by the if
statement. It is somewhat difficult to guarantee the truthness of the controlling expression, so I've modified it somewhat by moving the variables to global scope. The same question remains: is this program always valid:
#include <stdio.h>
#include <string.h>
static int a[1] = { 2 };
static int b = 1;
static int *pa1 = &a[0] + 1;
static int *pb = &b;
int main(void) {
if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
int *p;
printf ("pa1 == pb\n"); // interesting part
memcpy (&p, &pa1, sizeof p); // make a copy of the representation
memcpy (&pa1, &p, sizeof p); // pa1 is a copy of the bytes of pa1 now
// and the bytes of pa1 happens to be the bytes of pb
*pa1 = 2; // does pa1 legally point to b?
}
}
Now the guarding expression is true on my compiler (of course, by having these have static storage duration, a compiler cannot really prove that they're not modified by something else in the interim...)
The pointer pa1
points to just past the end of the array a
, and is a valid pointer, but must not be dereferenced, i.e. *pa1
has undefined behaviour given that value. The case is now made that copying this value to p
and back again would make the pointer valid.
The answer is no, this is still not valid, but it is not spelt out very explicitly in the standard itself. The committee response to C standard defect report DR 260 says this:
If two objects have identical bit-pattern representations and their types are the same they may still compare as unequal (for example if one object has an indeterminate value) and if one is an indeterminate value attempting to read such an object invokes undefined behavior. Implementations are permitted to track the origins of a bit-pattern and treat those representing an indeterminate value as distinct from those representing a determined value. They may also treat pointers based on different origins as distinct even though they are bitwise identical.
I.e. you cannot even draw the conclusion that if pa1
and pb
are pointers of same type and memcmp (&pa1, &pb, sizeof pa1) == 0
is true that it is also necessary pa1 == pb
, let alone that copying the bit pattern of undereferenceable pointer pa1
to another object and back again would make pa1
valid.
The response continues:
Note that using assignment or bitwise copying via
memcpy
ormemmove
of a determinate value makes the destination acquire the same determinate value.
i.e. it confirms that memcpy (&p, &pa1, sizeof p);
will cause p
to acquire the same value as pa1
, which it didn't have before.
This is not just a theoretical problem - compilers are known to track pointer provenance. For example the GCC manual states that
When casting from pointer to integer and back again, the resulting pointer must reference the same object as the original pointer, otherwise the behavior is undefined. That is, one may not use integer arithmetic to avoid the undefined behavior of pointer arithmetic as proscribed in C99 and C11 6.5.6/8.
i.e. were the program written as:
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
if (memcmp (&pa1, &pb, sizeof pa1) == 0) {
uintptr_t tmp = (uintptr_t)&a[0]; // pointer to a[0]
tmp += sizeof (a[0]); // value of address to a[1]
pa1 = (int *)tmp;
*pa1 = 2; // pa1 still would have the bit pattern of pb,
// hold a valid pointer just past the end of array a,
// but not legally point to pb
}
the GCC manual points out that this is explicitly not legal.
Undefined behaviour: A play in n
parts.
Compiler1 and Compiler2 enter, stage right.
int a[1] = { 0 }, *pa1 = &a[0] + 1, b = 1, *pb = &b;
[Compiler1] Hello,
a
,pa1
,b
,pb
. How very nice to make your acquaintance. Now you just sit right there, we're going to look through the rest of the code to see if we can allocate you some nice stack space.
Compiler1 looks through the rest of the code, frowning occasionally and making some markings on the paper. Compiler2 picks his nose and stares out the window.
[Compiler1] Well, I'm afraid,
b
, that I have decided to optimize you out. I simply couldn't detect somewhere which modified your memory. Maybe your programmer did some tricks with Undefined Behaviour to work around this, but I'm allowed to assume that there is no such UB present. I'm sorry.
Exit b
, pursued by a bear.
[Compiler2] Wait! Hold on a second there,
b
. I couldn't be bothered optimizing this code, so I've decided to give you a nice cosy space over there on the stack.
b
jumps in glee, but is murdered by nasal demons as soon as he is modified through undefined behaviour.
[Narrator] Thus ends the sad, sad tale of variable
b
. The moral of this story is that one can never rely on undefined behaviour.
No. We cannot even infer that either branch of this code works given any particular result of memcmp()
. The object representations that you compare with memcmp()
might be different even if the pointers would be equivalent, and the pointers might be different even if the object representations match. (I’ve changed my mind about this since I originally posted.)
You try to compare an address one-past-the-end of an array with the address of an object outside the array. The Standard (§6.5.8.5 of draft n1548, emphasis added) has this to say:
When two pointers are compared, the result depends on the relative locations in the address space of the objects pointed to. If two pointers to object types both point to the same object, or both point one past the last element of the same array object, they compare equal. If the objects pointed to are members of the same aggregate object, pointers to structure members declared later compare greater than pointers to members declared earlier in the structure, and pointers to array elements with larger subscript values compare greater than pointers to elements of the same array with lower subscript values. All pointers to members of the same union object compare equal. If the expression P points to an element of an array object and the expression Q points to the last element of the same array object, the pointer expression Q+1 compares greater than P. In all other cases, the behavior is undefined.
It repeats this warning that the result of comparing the pointers is undefined, in appendix J.
Also undefined behavior:
An object which has been modified is accessed through a restrict qualified pointer to a const-qualified type, or through a restrict-qualified pointer and another pointer that are not both based on the same object
However, none of the pointers in your program are restrict-qualified. Neither do you do illegal pointer arithmetic.
You try to get around this undefined behavior by using memcmp()
instead. The relevant part of the specification (§7.23.4.1) says:
The
memcmp
function compares the firstn
characters of the object pointed to bys1
to the firstn
characters of the object pointed to bys2
.
So, memcmp()
compares the bits of the object representations. Already, the bits of pa1
and pb
will be the same on some implementations, but not others.
§6.2.6.1 of the Standard makes the following guarantee:
Two values (other than NaNs) with the same object representation compare equal, but values that compare equal may have different object representations.
What does it mean for pointer values to compare equal? §6.5.9.6 tells us:
Two pointers compare equal if and only if both are null pointers, both are pointers to the same object (including a pointer to an object and a subobject at its beginning) or function, both are pointers to one past the last element of the same array object, or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space.
That last clause, I think, is the clincher. Not only can two pointers that compare equal have different object representations, but two pointers with the same object representation might not be equivalent if one of them is a one-past-the-end pointer like &a[0]+1
and another is a pointer to an object outside the array, like &b
. Which is exactly the case here.
Prior to C99, implementations were expected to behave as though the value of every variable of any type was stored a sequence of unsigned char
values; if the underlying representations of two variables of the same type were examined and found to be equal, that would imply that unless Undefined Behavior had already occurred, their values would generally be equal and interchangeable. There was a little bit of ambiguity in a couple places, e.g. given
char *p,*q;
p = malloc(1);
free(p);
q = malloc(1);
if (!memcmp(&p, &q, sizeof p))
p[0] = 1;
every version of C has made abundantly clear that q
may or may not equal to p
, and if q
isn't equal to p
code should expect that anything might happen when p[0]
is written. While the C89 Standard does not explicitly say that an implementation may only have p
compare bitwise equal to q
if a write to p
would be equivalent to a write to q
, such behavior would generally be implied by the model of variables being fully encapsulated in sequences of unsigned char
values.
C99 added a number of situations where variables may compare bitwise equal but not be equivalent. Consider, for example:
extern int doSomething(char *p1, char *p2);
int act1(char * restrict p1, char * restrict p2)
{ return doSomething(p1,p2); }
int act2(char * restrict p)
{ return doSomething(p,p); }
int x[4];
int act3a(void) { return act1(x,x); }
int act3b(void) { return act2(x); }
int act3c(void) { return doSomething(x,x); }
Calling act3a
, act3b
, or act3c
will cause doSomething()
to be invoked with two pointers that compare equal to x
, but if invoked through act3a
, any element of x
which is written within doSomething
must be accessed exclusively using x
, exclusively using p1
, or exclusively using p2
. If invoked through act3b
, the method would gain the freedom to write elements using p1
and access them via p2
or vice versa. If accessed through act3c
, the method could use p1
, p2
, and x
interchangeably. Nothing in the binary representations of p1
or p2
would indicate whether they could be used interchangeably with x
, but a compiler would be allowed to in-line expand doSomething
within act1
and act2
and have the behavior of those expansions vary according to what pointer accesses were allowed and forbidden.
A pointer is simply an unsigned integer whose value is the address of some location in memory. Overwriting the contents of a pointer variable is no different than overwriting the contents of normal int
variable.
So yes, doing e.g. memcpy (&p, &pa1, sizeof p)
is equivalent of the assignment p = pa1
, but might be less efficient.
Lets try it a bit differently instead:
You have pa1
which points to some object (or rather, one beyond some object), then you have the pointer &pa1
which points to the variable pa1
(i.e. the where the variable pa1
is located in memory).
Graphically it would look something like this:
+------+ +-----+ +-------+ | &pa1 | --> | pa1 | --> | &a[1] | +------+ +-----+ +-------+
[Note: &a[0] + 1
is the same as &a[1]
]