Patterns on Fixnum object_ids in Ruby?

问题

The object_id of 0 is 1, of 1 is 3, of 2 is 5.

Why is this pattern like this? What is behind the Fixnums that make to create that pattern of object_ids? I would expect that if 0 has id 1, 1 has id 2, 2 has id 3.. And so on.

What am I missing?

回答1:

First things first: the only thing that the Ruby Language Specification guarantees about object_ids is that they are unique in space. That's it. They aren't even unique in time.

So, at any given time there can only be one object with a specific object_id at the same time, however, at different times, object_ids may be reused for different objects.

To be fully precise: what Ruby guarantees is that

object_id will be an Integer
no two objects will have the same object_id at the same time
an object will have the same object_id over its entire lifetime

What you are seeing is a side-effect of how object_id and Fixnums are implemented in YARV. This is a private internal implementation detail of YARV that is not guaranteed in any way. Other Ruby implementations may (and do) implement them differently, so this is not guaranteed to be true across Ruby implementations. It is not even guaranteed to be true across different versions of YARV, or even for the same version on different platforms.

And in fact, it actually did change quite recently, and it is different between 32-bit and 64-bit platforms.

In YARV, object_id is simply implemented as returning the memory address of the object. That's one piece of the puzzle.

Nut, why are the memory addresses of Fixnums so regular? Well, actually, in this case, they aren't memory addresses! YARV uses a special trick to encode some objects into pointers. There are some pointers which aren't actually being used, and so you can use them to encode certain things.

This is called a tagged pointer representation, and is a pretty common optimization trick used in many different interpreters, VMs and runtime systems for decades. Pretty much every Lisp implementation uses them, many Smalltalk VMs, many Ruby interpreters, and so on.

Usually, in those languages, you always pass around pointers to objects. An object itself consists of an object header, which contains object metadata (like the type of an object, its class(es), maybe access control restrictions or security annotations and so on), and then the actual object data itself. So, a simple integer would be represented as a pointer plus an object consisting of metadata and the actual integer. Even with a very compact representation, that's something like 6 Byte for a simple integer.

Also, you cannot pass such an integer object to the CPU to perform fast integer arithmetic. If you want to add two integers, you really only have two pointers, which point to the beginning of the object headers of the two integer objects you want to add. So, you first need to perform integer arithmetic on the first pointer to add the offset into the object to it where the integer data is stored. Then you have to dereference that address. Do the same again with the second integer. Now you have two integers you can actually ask the CPU to add. Of course, you need to now construct a new integer object to hold the result.

So, in order to perform one integer addition, you actually need to perform three integer additions plus two pointer dererefences plus one object construction. And you take up almost 20 bytes.

However, the trick is that with so-called immutable value types like integers, you usually don't need all the metadata in the object header: you can just leave all that stuff out, and simply synthesize it (which is VM-nerd-speak for "fake it"), when anyone cares to look. A fixnum will always have class Fixnum, there's no need to separately store that information. If someone uses reflection to figure out the class of a fixnum, you simply reply Fixnum and nobody will ever know that you didn't actually store that information in the object header and that in fact, there isn't even an object header (or an object).

So, the trick is to store the value of the object within the pointer to the object, effectively collapsing the two into one.

There are CPUs which actually have additional space within a pointer (so-called tag bits) that allow you to store extra information about the pointer within the pointer itself. Extra information like "this isn't actually a pointer, this is an integer". Examples include the Burroughs B5000, the various Lisp Machines or the AS/400. Unfortunately, most of the current mainstream CPUs don't have that feature.

However, there is a way out: most current mainstream CPUs work significantly slower when addresses aren't aligned on word boundaries. Some even don't support unaligned access at all.

What this means is that in practice, all pointers will be divisible by 4 (on a 32-bit system, 8 on a 64-bit system), which means they will always end with two (three on a 64-bit system) 0 bits. This allows us to distinguish between real pointers (that end in 00) and pointers which are actually integers in disguise (those that end with 1). And it still leaves us with all pointers that end in 10 free to do other stuff. Also, most modern operating systems reserve the very low addresses for themselves, which gives us another area to mess around with (pointers that start with, say, 24 0s and end with 00).

So, you can encode a 31-bit (or 63-bit) integer into a pointer, by simply shifting it 1 bit to the left and adding 1 to it. And you can perform very fast integer arithmetic with those, by simply shifting them appropriately (sometimes not even that is necessary).

What do we do with those other address spaces? Well, typical examples include encoding floats in the other large address space and a number of special objects like true, false, nil, the 127 ASCII characters, some commonly used short strings, the empty list, the empty object, the empty array and so on near the 0 address.

In YARV, integers are encoded the way I described above, false is encoded as address 0 (which just so happens also to be the representation of false in C), true as address 2 (which just so happens to be the C representation of true shifted by one bit) and nil as 4.

In YARV, the following bit patterns are used to encode certain special objects:

xxxx xxxx … xxxx xxx1    Fixnum
xxxx xxxx … xxxx xx10    flonum
0000 0000 … 0000 1100    Symbol
0000 0000 … 0000 0000    false
0000 0000 … 0000 1000    nil
0000 0000 … 0001 0100    true
0000 0000 … 0011 0100    undefined

Fixnums are 63-bit integers that fit into a single machine word, flonums are 62-bit Floats that fit into a single machine word. false, nil and true are what you would expect, undefined is a value that is only used inside the implementation but not exposed to the programmer.

Note that on 32-bit platforms, flonums aren't used (there's no point in using 30-bit Floats), and so the bit patterns are different. nil.object_id is 4 on 32-bit platforms, not 8 like on 64-bit platforms, for example.

So, there you have it:

certain small integers are encoded as pointers
pointers are used for object_ids

Therefore

certain small integers have predictable object_ids

回答2:

For Fixnum i, the object_id is i * 2 + 1.

For object_id of 0, 2, 4, what are them? They are false, true, nil in ruby.

来源：https://stackoverflow.com/questions/25735923/patterns-on-fixnum-object-ids-in-ruby

标签

ruby

objectid