Document number: N3158=10-0148
Date: 2010-10-13
Author: Daniel Krügler
Project: Programming Language C++, Library Working Group
Reply-to: Daniel Krügler

Missing preconditions for default-constructed match_result objects

Addresses: GB 126

Discussion

NB comment GB 126 asserts:

"It's unclear how match_results should behave if it has been default-constructed. The sub_match objects returned by operator[], prefix and suffix cannot point to the end of the sequence that was searched if no search was done. The iterators held by unmatched sub_match objects might be singular."

In contrast to the pre-evaluation during the Rapperswil meeting the author of this proposal believes that this comments describes a real issue. There are several problems involved with the current state of sub_match and match_results:

One problem is that default-constructed sub_match objects have an uninitialized value for the matched member as of the semantics of the compiler-generated default constructor. This restricts strongly the usability of such objects, they can only be assigned to a new value, invoking any other member operations will be undefined behaviour, because basically all of them query the matched member. Furthermore, since these objects are elements of match_results objects, these restrictions propagate as further implied restrictions to several member functions of the of match_results container in a subtle way. E.g. the function length() has the same effects as [sub].length(), but we may not satisfy the (implied) requirements to call this sub_match function.

Additional to that, match_results can have two different observable states, when being empty. They can be empty, because they have not participated in any matching attempt yet, or they can be empty because of an unsuccessful matching attempt. The reason for the difference is, that in the first case queried sub_match elements have a range defined by a pair of two value-initialized iterators, in the second case they have a range defined by the past-the-end values of the target sequence. Value-initialized iterator provide only little guaranteed usages. The only guarantee provided is that they can be used as a source of a copy or move operation. This means that we don't need to impose restrictions to move or copy operations of the matched_results object itself, but this still doesn't guarantee that such an iterator value can be used as an argument of operator==. According to 24.2.5/2:

The domain of == for forward iterators is that of iterators over the same underlying sequence.

which was a response to the LWG issue 446. So, while this works for a pair of null pointers, it is not guaranteed to work for a pair of value-initialized user-defined iterators, because we have no well-defined underlying sequence. Without extra wording it is not valid to conclude that every empty range is equivalent to any other empty sequence, plus we don't know whether operator== is defined for this specific object state at all. For example, such iterator values could have a NaN-like state that might raise an error when trying to compare them. The current iterator requirements allow this interpretation so unless the library explicitly defines further operations of value-initialized (forward) iterators, we have to accept this state. To enforce this effect, the library could ad hoc require that forward iterators are near to a NullablePointer (except for conversions from std::nullptr_t), which seems to me somewhat too ambitious at this stage of the standard because this might easily break valid code that relies on the current freedom for such restrictions. Another option to solve this problem could be to add restrictions that only non-empty match_results can be used for such functions, but that would have the effect that results from unsuccessful searches are much harder to handle than those from successful, we would basically be required to forbid all operations - including match_results::operator== - for empty match_results objects - this would indeed make unmatched results second class citizens!

The real problem seems to me, that user code has no portable means to verify in which empty state such an object exists, especially if the corresponding iterator type does not support equality comparison of value-initialized iterators (we cannot tell them to check whether these values are value-initialized). But without a way to assert this state its hard to ensure that preconditions are satisfied for several functions. While such a situation is ok for a weak-referencing iterator-wrapper like reverse_iterator or move_iterator, it seems inappropriate for a self-managing container like match_results that already must be aware of these different states.

During preparation of this proposal a new issue was found: Even if we ignore the problem of match results that have not yet participated in a match, we have another one related to operator==. The current specification is as follows:

template <class BidirectionalIterator, class Allocator>
bool operator==(const match_results<BidirectionalIterator, Allocator>& m1,
                const match_results<BidirectionalIterator, Allocator>& m2);

1 Returns: true only if the two objects refer to the same match.

It is not really clear what this means: The current specification would allow for an implementation to return true, only if the address values of m1 and m2 are the same. While this approach is unproblematic in terms of used operations this is also a bit unsatisfactory. With identity equality alone there seems to be no convincing reason to provide this operator at all. It could for example also refer to an comparison based on iterator values. In this case a user should better know that this will be done, because - as quoted before - there is no guarantee at all that inter-container comparison of iterators is a feasible operation. This was a clear outcome of the resolution provided in N3066 for LWG issue 446. It could also mean that a character-based comparison of the individual sub_match elements should be done - this would be equivalent to applying operator== to the subexpressions, prefix and suffix.

As a guidance to solve this issue I searched for comparable functionality in the Library. The nearest one seemed to be the bucket API in unordered containers. The problem with this type as a comparison is, that match_results are intended to return sub_match elements, even, if the corresponding index value is out of range. Another family of library components that could be used as models are future and friends, which provide a valid observer that can be inspected. Most member function require that this attribute is true. It was hard to fine a proper name for such a state: The following names had been considered:

In the end the author decided for the name ready as a weak winner based on the criteria brevity and conciseness.

This paper suggest to perform the following changes to solve the problems raised by NB comment GB 126:

Proposed resolution

The following wording changes are against N3126.

  1. Change 28.9 [re.submatch]/1, class template sub_match synopsis, as indicated. The intent is to provide a user-defined default-constructor. We recommend to make it constexpr to support participation in static initialization (The base class pair has a constexpr default constructor, if the value-initialization of the iterator pair allows this):
    namespace std {
      template <class BidirectionalIterator>
      class sub_match : public std::pair<BidirectionalIterator, BidirectionalIterator> {
      public:
        typedef typename iterator_traits<BidirectionalIterator>::
          value_type                              value_type;
        typedef typename iterator_traits<BidirectionalIterator>::
          difference_type                         difference_type;
        typedef BidirectionalIterator             iterator;
        typedef basic_string<value_type>          string_type;
    
        bool matched;
    
        constexpr sub_match();
    
        difference_type length() const;
        operator string_type() const;
        string_type str() const;
    
        int compare(const sub_match& s) const;
        int compare(const string_type& s) const;
        int compare(const value_type* s) const;
      };
    }
    
  2. Insert a new member prototype description at the very beginning 28.9.1 sub_match members [re.submatch.members]:

    constexpr sub_match();
    

    Effects: Value-initializes the pair base class subobject and the member matched.

  3. Insert a new (numbered) paragraph between and [re.results]/2 and [re.results]/3 as indicated. The intent is to clarify, when a match_results object is a fully valid result:

    1 Class template match_results denotes a collection of character sequences representing the result of a regular expression match. Storage for the collection is allocated and freed as necessary by the member functions of class template match_results.

    2 The class template match_results shall satisfy the requirements of an allocator-aware container and of a sequence container, as specified in 23.2.3, except that only operations defined for const-qualified sequence containers are supported.

    ? A default-constructed match_results object has no fully established result state. As a consequence of a completed regular expression match modifying such an object its result state becomes fully established, the match result is ready. For most member functions, the effects of calling them from a match_results object that is not ready are undefined.

    3 The sub_match object stored at index 0 represents sub-expression 0, i.e., the whole match. In this case the sub_match member matched is always true. The sub_match object stored at index n denotes what matched the marked sub-expression n within the matched expression. If the sub-expression n participated in a regular expression match then the sub_match member matched evaluates to true, and members first and second denote the range of characters [first,second) which formed that match. Otherwise matched is false, and members first and second point to the end of the sequence that was searched. [ Note: The sub_match objects representing different sub-expressions that did not participate in a regular expression match need not be distinct. — end note ]

  4. Add a new observer function to the match_results class synopsis. The intent it to provide a test function to verify whether such an object has been used in any match attempt:
    namespace std {
      template <class BidirectionalIterator,
                class Allocator = allocator<sub_match<BidirectionalIterator> >
      class match_results {
      public:
        [..]
    
        // [re.results.state] state:
        bool ready() const;
    
        // 28.10.2 size:
        size_type size() const;
        size_type max_size() const;
        bool empty() const;
        
        [..]
      };
    
  5. Change [re.results.const]/3 as indicated. The post-conditions need to specify that the result is not yet ready. We also remove the previous post-condition regarding str(), because we exclude this function from being in the domain of a not-ready match result:

    match_results(const Allocator& a = Allocator());
    

    2 Effects: Constructs an object of class match_results.

    3 Postconditions: ready() returns false. size() returns 0. str() returns basic_string<char_type>().

  6. Change Table 138 as indicated, we need to say something about the ready state:

    Table 138 — match_results assignment operator effects
    Element Value
    ready() m.ready()
    size() m.size()
    str(n) m.str(n) for all integers n < m.size()
    ... ...

  7. Add a new sub-clause [re.results.state] just before sub-clause [re.results.size]:

    28.10.? match_results state [re.results.state]

    bool ready() const;
    

    ? Returns: true, if *this corresponds to a match result with fully established result state, otherwise false.

  8. Change the following paragraphs in 28.10.3 [re.results.acc] as indicated. The intent is add pre-conditions as appropriate [Note: The (c)begin()/(c)end() functions are intentionally unconstrained]:

    difference_type length(size_type sub = 0) const;
    

    ? Requires: ready() == true.

    1 Returns: (*this)[sub].length().

    difference_type position(size_type sub = 0) const;
    

    ? Requires: ready() == true.

    2 Returns: The distance from the start of the target sequence to (*this)[sub].first.

    string_type str(size_type sub = 0) const;
    

    ? Requires: ready() == true.

    3 Returns: string_type((*this)[sub]).

    const_reference operator[](size_type n) const;
    

    ? Requires: ready() == true.

    4 Returns: A reference to the sub_match object representing the character sequence that matched marked sub-expression n. If n == 0 then returns a reference to a sub_match object representing the character sequence that matched the whole regular expression. If n >= size() then returns a sub_match object representing an unmatched sub-expression.

    const_reference prefix() const;
    

    ? Requires: ready() == true.

    5 Returns: A reference to the sub_match object representing the character sequence from the start of the string being matched/searched to the start of the match found.

    const_reference suffix() const;
    

    ? Requires: ready() == true.

    6 Returns: A reference to the sub_match object representing the character sequence from the end of the match found to the end of the string being matched/searched.

  9. Change the following paragraphs in 28.10.4 [re.results.form] as indicated. The intent is add pre-conditions as appropriate:
    template <class OutputIter>
      OutputIter format(OutputIter out,
        const char_type* fmt_first, const char_type* fmt_last,
          regex_constants::match_flag_type flags =
            regex_constants::format_default) const;
    

    1 Requires: ready() == true and OutputIter shall satisfy the requirements for an Output Iterator (24.2.4).

    [...]

    template <class ST, class SA>
      basic_string<char_type, ST, SA>
        format(const basic_string<char_type, ST, SA>& fmt,
          regex_constants::match_flag_type flags =
            regex_constants::format_default) const;
    

    ? Requires: ready() == true.

    5 Effects: Constructs an empty string result of type basic_string<char_type, ST, SA> and calls format(back_inserter(result), fmt, flags).

    6 Returns: result.

    string_type
      format(const char_type* fmt,
        regex_constants::match_flag_type flags =
          regex_constants::format_default) const;
    

    ? Requires: ready() == true.

    7 Effects: Constructs an empty string result of type string_type and calls format(back_inserter(result), fmt, fmt + char_traits<char_type>::length(fmt), flags).

    8 Returns: result.

  10. Change 28.10.7 [re.results.nonmember]/1 as indicated. The intent is to specify the semantics of equality. This takes into account a potential not-ready nature and clarifies that for complete match results the sequences of sub_match objects (including prefix and suffix) will be compared "by value". The wording is carefully crafted to ensure that for empty results only the guaranteed attribute values are queried:

    template <class BidirectionalIterator, class Allocator>
      bool operator==(const match_results<BidirectionalIterator, Allocator>& m1,
                      const match_results<BidirectionalIterator, Allocator>& m2);
    

    1 Returns: true only if the two objects refer to the same match.true if and only if both match results are not ready, or both are ready and if

    • m1.empty() && m2.empty(), or
    • !m1.empty() && !m2.empty(), and the following conditions are satisfied:
      • m1.prefix() == m2.prefix(),
      • m1.size() == m2.size() && equal(m1.begin(), m1.end(), m2.begin()), and
      • m1.suffix() == m2.suffix()

    [Note: the algorithm equal() is defined in Clause 25. — end note]

  11. Change the [re.alg.match]/3 as indicated. The intend is to ensure that the match result is ready:

    template <class BidirectionalIterator, class Allocator, class charT, class traits>
      bool regex_match(BidirectionalIterator first, BidirectionalIterator last,
        match_results<BidirectionalIterator, Allocator>& m,
          const basic_regex<charT, traits>& e,
            regex_constants::match_flag_type flags =
              regex_constants::match_default);
    

    [...]

    3 Postconditions: m.ready() == true in all cases. If the function returns false, then the effect on parameter m is unspecified except that m.size() returns 0 and m.empty() returns true. Otherwise the effects on parameter m are given in Table 139.

  12. Change the [re.alg.search]/3 as indicated. The intend is to ensure that the match result is ready:

    template <class BidirectionalIterator, class Allocator, class charT, class traits>
      bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
        match_results<BidirectionalIterator, Allocator>& m,
          const basic_regex<charT, traits>& e,
            regex_constants::match_flag_type flags =
              regex_constants::match_default);
    

    [...]

    3 Postconditions: m.ready() == true in all cases. If the function returns false, then the effect on parameter m is unspecified except that m.size() returns 0 and m.empty() returns true. Otherwise the effects on parameter m are given in Table 140.