Resolving the difference between C and C++ with regards to object representation of integers.

Doc. no.:N2631=08-0141
Date:2008-05-14
Author:James Kanze
email:james.kanze@gmail.com


Introduction

In recent discussions in comp.lang.c++, it became clear that C and C++ have different requirements concerning the object representation of integers, and that at least one real implementation of C does not meet the C++ requirements. The purpose of this paper is to suggest wording to align the C++ standard with C.

It should be noted that the issue only concerns some fairly “exotic” hardware. In this regard, it raises a somewhat larger issue: how far do we want to go in supporting exotic hardware? (In the discussions in the news group, one noted expert expressed the opinion that we could even go so far as to require two's complement.) My personal opinion is that at least with regards to integer types, we should remain 100% C compatible, and simply follow C. That is, however, only a personal opinion. (Perhaps some sort of vote should be in order, expressing the direction we want to take.)

The Problem

The requirements concerning the object representation of integers are not the same in the current draft (and in all previous versions) of the C++ standard and in the C99 standard. In [basic.types]/4, the C++ standard says:

[...]The value representation of an object is the set of bits that hold the value of type T.[...]
and in [basic.fundamental]/3:
For each of the standard signed integer types, there exists a corresponding (but different) standard unsigned integer type: [...]the value representation of each corresponding signed/unsigned type shall be the same.[...]
This implies, indirectly, that the maximum value of an unsigned integral type must be greater than that of a signed integral type (or more precisely, that the sign bit in a signed integral type must participate in the value representation of the corresponding unsigned type—other constraints mean that it must, in fact, be the most significant bit). C90 didn't have this restriction, and C99 explicitly says that (§6.2.6.2):
[...] For signed integer types, the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. [In other words, unlike C++, the sign bit is *not* part of the value representation.] [...](if there are M value bits in the signed type and N in the unsigned type, then M ≤ N).
In other words, in C, given an architecture which doesn't support unsigned arithmetic, the implementation can fake it by simply masking out the sign bit of a signed int. Concretely, the Unisys MCP processors make use of this. From their C manual:
Range of Data Types
Type Bits sizeof Range
char 8 1 0 to 255
[...]
int 48 6 1−2**39 to 2**39−1
signed int 48 6 1−2**39 to 2**39−1
unsigned int 48 6 0 to 2**39−1
I suspect that this difference is unintentional, and that it was never the intent of the C++ committee to be incompatible with C here, but as it stands, there is an incompatibility, and it affects at least one architecture currently being sold.

Proposed solution

If C compatibility is desired, it seems to me that the simplest and surest way of attaining this is by incorporating the exact words from the C standard, in place of the current wording. I thus propose that we adopt the wording from the C standard, as follows (text taken verbatim from the C standard is in blue; text that has been modified or added is dark cyan, underlined, and inline comments concerning the text—which aren't meant to be incorporated into the text of the standard—are black and in italics):

In [basic.types], after paragraph 4, add the following paragraph:

Certain object representations need not represent a value of the object type. If the stored value of an object has such a representation and is read by an lvalue expression that does not have character type, the behavior is undefined. If such a representation is produced by a side effect that modifies all or any part of the object by an lvalue expression that does not have character type, the behavior is undefined. Such a representation is called a trap representation.

In [basic.fundamental], replace paragraphs 1–4 with:

An object declared as type char is large enough to store any member of the basic execution character set. If a member of the basic execution character set is stored in a char object, its value is guaranteed to be positive; in addition, the integral value of that character object is equal to the value of the single character literal form of that character. If any other character is stored in a char object, the resulting value is implementation-defined but shall be within the range of values that can be represented in that type.

There are five standard signed integer types, designated as signed char, short int, int, long int, and long long int. (These and other types may be designated in several additional ways, as described in [cstdint]) There may also be implementation-defined extended signed integer types [Note: Implementation-defined keywords shall have the form of an identifier reserved for any use as described in [global.names] —end note]. The standard and extended signed integer types are collectively called signed integer types. [Note: Therefore, any statement in this Standard about signed integer types also applies to the extended signed integer types. —end note]

An object declared as type signed char occupies the same amount of storage as a “plain” char object. A “plain” int object has the natural size suggested by the architecture of the execution environment (large enough to contain any value in the range INT_MIN to INT_MAX as defined in the header <climits>).

For each of the signed integer types, there is a corresponding (but different) unsigned integer type (designated with the keyword unsigned) that uses the same amount of storage (including sign information) and has the same alignment requirements. The unsigned integer types that correspond to the standard signed integer types are the standard unsigned integer types. The unsigned integer types that correspond to the extended signed integer types are the extended unsigned integer types. The standard and extended unsigned integer types are collectively called unsigned integer types. [Note: Therefore, any statement in this Standard about unsigned integer types also applies to the extended unsigned integer types. —end note]

The standard signed integer types and standard unsigned integer types are collectively called the standard integer types, the extended signed integer types and extended unsigned integer types are collectively called the extended integer types.

Thge standard types char, unsigned char and signed char are collecdtively called standard character types. They shall not contain padding bits; all bits must participate in the value representation.

For any two integer types with the same signedness and different integer conversion rank (see [conv.rank]), the range of values of the type with smaller integer conversion rank is a subrange of the values of the other type.

The range of nonnegative values of a signed integer type is a subrange of the corresponding unsigned integer type, and the representation of the same value in each type is the same.[Note: The same representation and alignment requirements are meant to imply interchangeability as arguments to functions, return values from functions, and members of unions. —end note] A computation involving unsigned operands can never overflow, because a result that cannot be represented by the resulting unsigned integer type is reduced modulo the number that is one greater than the largest value that can be represented by the resulting type.

For unsigned integer types (and plain char, if it takes on the same values as an unsigned char), the bits of the object representation shall be divided into two groups: value bits and padding bits (there need not be any of the latter, and shall not be in the case of unsigned char and char). If there are N value bits, each bit shall represent a different power of 2 between 1 and 2N−1, so that objects of that type shall be capable of representing values from 0 to 2N− 1 using a pure binary representation; this shall be known as the value representation. The values of any padding bits are unspecified. [Note: Some combinations of padding bits might generate trap representations, for example, if one padding bit is a parity bit. Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow, and this cannot occur with unsigned types. All other combinations of padding bits are alternative object representations of the value specified by the value bits. —end note]

For signed integer types (and plain char, if it takes on the same values as a signed char), the bits of the object representation shall be divided into three groups: value bits, padding bits, and the sign bit. There need not be any padding bits (and in the case of char, if it is signed, there shall not be); there shall be exactly one sign bit. Each bit that is a value bit shall have the same value as the same bit in the object representation of the corresponding unsigned type (if there are M value bits in the signed type and N in the unsigned type, then MN). If the sign bit is zero, it shall not affect the resulting value. If the sign bit is one, the value shall be modified in one of the following ways:

Which of these applies is implementation-defined, as is whether the value with sign bit 1 and all value bits zero (for the first two), or with sign bit and all value bits 1 (for one's complement), is a trap representation or a normal value. In the case of sign and magnitude and one's complement, if this representation is a normal value it is called a negative zero.

If the implementation supports negative zeros, they shall be generated only by:

It is unspecified whether these cases actually generate a negative zero or a normal zero, and whether a negative zero becomes a normal zero when stored in an object.

If the implementation does not support negative zeros, the behavior of the &, |, ^, ~, <<, and >> operators with arguments that would produce such a value is undefined.

The values of any padding bits are unspecified.[Note: Some combinations of padding bits might generate trap representations, for example, if one padding bit is a parity bit. Regardless, no arithmetic operation on valid values can generate a trap representation other than as part of an exceptional condition such as an overflow. All other combinations of padding bits are alternative object representations of the value specified by the value bits. —end note] A valid (non-trap) object representation of a signed integer type where the sign bit is zero is a valid object representation of the corresponding unsigned type, and shall represent the same value.

The types unsigned char and char may be used for “bitwise” copy. [Note: this means that if signed char has a negative zero which is either a trapping value, or will be forced to positive zero on assignment, plain char must be unsigned. —end note]

I'm not sure that this was every really clearly specified, but it does seem common practice in the C++ community to do bitwise copies through char*. Of the two architectures with other than 2's complement that I'm aware of, both make plain char unsigned, presumably to allow this to work. (Of course, logically, plain char should always be unsigned, but historical reasons mean that implementors cannot always be logical.)

The precision of an integer type is the number of bits it uses to represent values, excluding any sign and padding bits. The width of an integer type is the same but including any sign bit; thus for unsigned integer types the two values are the same, while for signed integer types the width is one greater than the precision.

Open points

The C standard has various text (e.g. §6.2.6.1/6 and 7) which basically states that you cannot count on the contents of padding bytes in a struct or a union, but that they will never cause a trapping representation. I don't think this is necessary per se in C++, since assignment and initialization are member-wise in C++, rather than byte-wise as in C. On the other hand, perhaps we need something somewhere to say that the compiler generated copy operations may change the values of these bytes (but not in such a way as to create a trapping representation if there wasn't one there previously (and that after initialization, it is guaranteed that there is not a trapping representation).