ISO/ IEC JTC1/SC22/WG21 p0431r0

Document: P0431R0
Date: 2016-09-14
Reply-to: hrosen4@bloomberg.net
Audience: Core

tl;dr:
    a() << b(), a() = b(), and a() + b() are all different.
    a() = b() and a().operator=(b()) are different.
    Must a() << b() and operator<<(a(), b()) the same?

Title:
    Correcting Evaluation Order for C++

Author:
    Hyman Rosen

Abstract:
    The adoption of P0145 adds more order-of-evaluation rules to C++ that are
    inconsistent, difficult to teach, and break the equivalence of infix and
    functional forms of some operators.  Before it's too late, we should adopt
    a strict left-to-right order of evaluation for C++ expressions, including
    completion of side effects.  The result would be a starkly simple single
    rule that is easy to learn and remember, and the prevention of code that
    "accidentally works" but is actually unspecified.

Prolog:
    This paper was written with proposal P0145 (Refining Expression Evaluation
    Order for Idiomatic C++ by Gabriel Dos Reis, Herb Sutter, and Jonathan
    Caves) in view, and may copy some of its wording and examples.

Introduction:
    Order of evaluation of C++ expressions is a perennial source of confusion
    for programmers.  The language has a variety of rules and exceptions for
    the order of evaluation of subexpressions and the completion of side
    effects, and is inconsistent with itself (with respect to similar-appearing
    code in different contexts) and with C.  The adoption of P0145 at Oulu has
    only made the situation worse, and in such a way that it can never be
    repaired, due to its requirement that assigments evaluate the right-hand
    side before the left.

    Consider the following program:

        #include <stdio.h>
        int main() {
            int i = 0; int a[] = { ++i, ++i, ++i };
            printf("%d %d %d\n", a[0], a[1], a[2]);
        }

    C++ (from 2011) requires that the initializers of a[] be evaluated left-to-
    right, with side effects for each initializer complete before evaluation of
    the next begins ([dcl.init.list]/4) so this program prints "1 2 3".  C
    leaves the evaluation order of these initializers and their side effects
    unspecified (Initialization 6.7.9/23); experimentally, on Linux x86,
    gcc-4.3.2 prints "3 3 3" while gcc-4.9.2 prints "1 2 3".

    Now consider this variant:

        #include <stdio.h>
        void f(int a, int b, int c) { printf("%d %d %d\n", a, b, c); }
        int main() { int i = 0; f(++i, ++i, ++i); }

    In C and C++ prior to the adoption of P0145, this has undefined behavior.
    In C++ with P0145, this program may print the digits 1, 2, and 3 in any of
    the six possible orders.

    Then consider this code:

        int &a(); int &b();
        int main() {
            a() << b();
            a() += b();
            a() == b();
        }

    In C++ before P0145, for each full expression above a() and b() are called
    in unspecified order.  After P0145, the first expression requires that a()
    be called before b(), the second that a() be called after b(), and the last
    one remains unspecified.

    Finally, the adoption of P0145 breaks the equivalence of functional and
    infix forms of operators.  Formerly, the mechanical translation of code
    between a()@b() and operator@(a(), b()) or a().operator@(b()) did not
    change the meaning of the code, since evaluation order of a() and b() was
    unspecified for all three forms.  Now that this is no longer so, the rules
    require that in a()@=b(), b() is evaluated before a(), while in
    a().operator@=(b()), a() is evaluated before b().  It is also not clear
    from the wording of the standard whether an explicitly written
    operator@(a(),b()) must evaluate a() and b() in the same order required for
    a()@b().

    This situation should be intolerable; the designers of Java, for example,
    corrected it by fixing the order of evaluation for all expressions to be
    strictly left-to-right.  Other languages such as C# also have fixed order
    of evaluation, but with more complex rules.  (The defensive response of
    "then use those languages" does nothing to address the problem within C++.)

Motivation:
    Computer programs are written in order to effect some behavior from a
    computer.  When a computer program has unspecified or undefined behavior,
    it is generally the case that the programming system will choose one of the
    set of permitted behaviors to implement, and it often does so without any
    notice to the programmer as to which has been chosen, or indeed that such a
    choice was made.  If the programmer did not realize that more than one
    choice was possible, and the programming system happened to pick the
    behavior that the programmer intended, the program now has a latent error
    that might not manifest for years, until the program is built by a system
    that makes a different choice.

    Specifying order of evaluation reduces the possibility of such latent
    errors by reducing the possible interpretations of a program.  Thus, when
    the program is built by the programming system, if its behavior matches the
    intent of the programmer, future builds will behave the same way.

    Leaving some orders unspecified, as done by P0145, does nothing to address
    this problem.

    Recall how many years it took before anyone realized that calling
    f(new T, new T) could leak resources even when the function parameters are
    smart pointers.  Over and over, we see both laypeople and experts make the
    same mistakes when it comes to unspecified and undefined order.

Compatibility:
    The changes proposed by this document are compatible with Standard C++ as
    constituted before the acceptance of P0145, in that a conforming compiler
    is already permitted to translate a program in the way this document would
    require.

    The changes proposed by this document are additionally compatible with
    Standard C, in that a conforming C compiler is also permitted to translate
    a program in the way this document would require.

    Programs which depend on a different order of evaluation may break, but
    they are already depending on behavior not sepcified by the Standards;
    vendors may choose to provide options for specifying non-Standard orders.

Performance:
    It is more important for programs to be correct and to have specified
    behavior, and to be designed for the writer and the reader of the code,
    than to be underspecified in the hope that this will enable programming
    systems to produce more performant translations.  Performance should be a
    consideration when a change would have significant impact.  This is not the
    case for evaluation order, even for function call arguments, given that
    modern systems pass arguments in registers rather than on the stack.

Assignments and Associativity:
    Given the premise that expression evaluation order should be specified,
    such specifications sometimes propose that in assignments, whether simple
    or compound, the right-hand side should be evaluated before the left.  This
    is sometimes claimed to be a consequence of associativity rules (assignment
    associates right-to-left), or because "we have to determine the value
    before we can do the assignment."  P0145 as well as C# have this order for
    assignments, while Java does not.

    The claim for associativity is wrong.  Associativity (and precedence)
    determine how an expression should be fully parenthesized, not the order in
    which subexpressions should be evaluated.  That is, given

        int &a(); int &b(); int &c();
        a() = b() = c();

    associativity tells us that the expression is parenthesized as

        a() = (b() = c());

    But given this fully parenthesized expression, there is no reason to prefer
    a regime in which c() is called first, b() second, and a() last.  Indeed,
    offered the expression parenthesized as

        (a() = b()) = c();

    this regime still requires that c() is called first, b() second, and a()
    last, demonstrating that parenthesization order is irrelevant to evaluation
    order.  In fact, assignment has "right-to-left associativity" simply to
    have the meaning of

        x = y = z;

    be "assign z to y then assign y to x" rather than "assign y to x then
    assign z to x".  It is to have such an expression assign to multiple
    variables rather than to repeatedly assign to one variable.

    The claim of "we have to determine the value before we can do the
    assignment, so we should evaluate that first" is also wrong, as
    demonstrated by the expression

        a() + b() * c();

    It is equally true that the results of b() and c() must be determined and
    multiplied before the product can be added to a(), but few would advocate
    for requiring a calling order of b(), c(), a().

Infix vs. Function Incompatibility:
    Requiring assignment to be right-to-left causes additional incompatibility.
    It is no longer possible to replace assignments of class objects using
    infix notation with their function call equivalents.  That is, given

        Klass &a(); Klass &b();
        a() = b();

    we can no longer rewrite the assignment as

        a().operator=(b());

    because the infix notation requires that b() be called before a() while
    the function style requires that a() be called before b().

Left to Right:
    Given the premise that expression evaluation order should be specified, and
    seeing that arguments from associativity are incorrect, it is clear that
    expressions should be evaluated strictly left-to-right.  This is the order
    in which C++ code is written and read, and it is a starkly simple rule,
    trivial to learn and impossible to forget.  Even those who believe that the
    natural order for assignments is right-hand-side before left-hand-side will
    only need to be disabused of this notion once, and will never forget.

Changes to the Standard:
    The following changes are with respect to N4606.

[intro.execution] 1.9/16
    Except where noted, evaluations of operands of individual operators and of
    subexpressions of individual expressions are
    --unsequenced.
    ++sequenced left-to-right.

[expr.call] 5.2.2/5
    --The initialization of a parameter, including every associated value
    --computation and side effect, is indeterminately sequenced with respect
    --to that of any other parameter.
    ++The initialization of function parameters, including every associated
    ++value computation and side effect, is sequenced left-to-right.

[expr.mptr.oper] 5.5/4
    --Otherwise, the expression E1 is sequenced before the expression E2.

[expr.shift] 5.8/4
    --The expression E1 is sequenced before the expression E2.

[expr.ass] 5.18/1
    --The right operand is sequenced before the left operand.

[ovr.match.oper] 13.3.1.2/2
    --However, the operands are sequenced in the order prescribed for the
    --built-in operator (Clause 5).