Document Number | P1844R1 |
---|---|
Date | 2019-11-22 |
Audience | LEWGI |
Intended Ship Vehicle | C++23 |
Reply-To | Nozomu Katō |
Revises | P1844R0 |
char8_t
, char16_t
, or char32_t
through the template specialization.std::regex
, or std::wregex
. Modifying them is likely to cause ABI breaking.ECMAScript2019
syntax is selected, it requires not only recognizing the grammar of ECMAScript in ECMA-262 2019 or later, but also behaving the same as it, and the u
flag is assumed to be set.\xHH
with the clarification that it represents a character whose code unit value in UTF-16 is 0xHH
. Since code unit values 0x00-0xFF in UTF-16 represent U+0000-U+00FF respectively, this is not a technical change from R0, but a change of interpretation.regex_match
. In R0 they were proposed only for consistency with and by analogy to regex_search
.regex_iterator
.The default engine of C++'s regex is based on RegExp, the regular expression object of the ECMAScript specification third edition. Although its original RegExp had not been being modified for many years, in these years it has been enhanced as follows:
Unfortunately, while C++'s regex supports six regular expression grammars, all are inferior to other languages' regular expression features in richness of available expressions nowadays. In addition, the problem of how to support Unicode is still unsettled. To resolve these problems, or at least to improve the current situation, this paper proposes adding the new syntax option, ECMAScript2019
to regex.
Short answer: The ECMAScript based engine of C++ has been modified to depend on the locale deeply. The author of this proposal would like to release regex from the locale and to revert it to being locale independent as its original RegExp, then to add new features on it. But doing so involves deprecating some features of the default engine and seems to be difficult.
Long answer: RegExp of ECMAScript is locale independent and treats an input sequence as UTF-16. For example, /[a-z]/
is always interpreted "any character in the range from U+0061 to U+007A inclusive". This has the benefit of allowing that the set of characters that some character class expression matches can be settled even at compile time. This is true even if the icase
flag is set, by preparing a reversed case-folding table (while case-folding means converting each of "S"
, "s"
, and U"ſ"
to "s"
, reversed case-folding here means returning the set of characters that are converted into the same character when case-folding is done, such as converting each of "S"
, "s"
, and U"ſ"
to U"Ssſ"
.)
However, the ECMAScript engine of regex can be modified as being locale sensitive by setting the collate
flag. If this flag is used with the icase
flag, the pattern compiler is required to call tolower
and toupper
per character in one character class [] for gathering all characters that the character class can match. This clumsiness was filed as Defect Report 523 in 2005 and is still Status Open.
Furthermore, the ECMAScript engine of C++ was modified to support the POSIX character class and to change the expressions \d
, \D
, \s
, \S
, \w
, and \W
to be equivalent to some of the POSIX character class. This also made the ECMAScript engine depend on the locale.
Although this paper is not intended to propose making regex constexpr
, if there is any ECMAScript based engine that inherits the nature of being locale independent from its original RegExp, it might make constexpr
support easier than now.
The modifications defined in [re.grammar] also caused DR2986, DR2987 and they are still unresolved. To fix these issues originating in the modifications, I considered to propose deprecating all the modifications and reverting the ECMAScript engine to its original specification of RegExp, before adding something new on it. But the ECMAScript engine is the default one of regex. Trying to deprecate some features of it seems to increase difficulty in getting the proposal accepted.
Moreover, while ECMAScript supports UTF-16 or now deprecated UCS-2 only, C++'s regex needs to support various character sets. It is a difficult question how Unicode property escapes should be processed in legacy (non-Unicode based) character sets.
Thus, this paper does not touch any part of the current ECMAScript engine and leaves it as is for legacy character sets that use char
and wchar_t
; and instead propose introducing a new syntax option being compliant to a recent version of ECMAScript keeping the nature of being locale independent, for use with char8_t
, char16_t
, or char32_t
through the template specialization.
Because the ECMAScript specification explains the behavior of RegExp through defining closures in detail, there seems to be less room for any undefined or undocumented behaviour to appear. The author thinks that it is an important factor to the standard.
The new syntax is implemented through the template specialization for char8_t
, char16_t
, and char32_t
. As of C++20, <regex>
supports only char
and wchar_t
and compiling basic_regex
with the other types leads to compile time error by the reasons explained in P0169. So, the existing implementations would not be affected by this proposal.
When the first template parameter charT
of the basic_regex
class is char8_t
, char16_t
, or char32_t
, its constructors and assign functions are required to interpret an input sequence as UTF-8, UTF-16, or UTF-32 respectively. In addition, syntax_option_type
s other than icase
, nosubs
, optimize
, and multiline
are disabled. An input sequence is always interpreted assuming that ECMAScript2019
is set.
These three specializations are not required necessarily to be implemented separately. It is expected that typical implementations use an internal iterator class template that translates a sequence of UTF-8, UTF-16, or UTF-32 to a sequence of Unicode code points and construct a finite state machine by parsing that translated sequence in their common base class.
One obstacle to implementing in such a way is the expression \xHH
where H
is a hexadecimal digit. In the spec of ECMAScript this expression is defined to specify a character by its code unit value. However, appearance of an isolated code unit in a UTF-8 sequence requires special treatment, because unlike in UTF-16 and UTF-32, an isolated code unit in UTF-8 cannot be converted to any code point when its value is in 0x80-0xFF inclusive. (This does not matter in ECMAScript, since it does not support UTF-8.)
Considering that ECMAScript assumes that a string is a sequence of UTF-16 or now deprecated UCS-2, in the proposed ECMAScript2019
syntax, the expression \xHH
is treated to represent a character whose code unit value in UTF-16 is 0xHH
, even it appears in a sequence of UTF-8. Since code unit values 0x00-0xFF in UTF-16 represent U+0000-U+00FF respectively, in this case, \xHH
represents a code point U+00HH
virtually.
(If the committee does not like the inconsistency with the meaning of C++'s own \xHH
, the previous paragraph will be changed to "... use of the expression \xHH
is disabled". ECMAScript's RegExp has \uHHHH
and \u{H...}
for specifying a character by its code point. Removing support for \xHH
is not likely to cause inconvenience.)
basic_regex<char8_t>
, basic_regex<char16_t>
, and basic_regex<char32_t>
must be locale independent. It means that these specializations have to construct an internal finite state machine only based on the Unicode code point values that are translated from the input sequence and may not use regex_traits
.
Thus, regex_traits
does not need to be specialized for char8_t
, char16_t
, and char32_t
.
Particularly, when basic_regex
is used with char8_t
, char16_t
, or char32_t
, case-folding may not be performed using regex_traits<charT>::translate_nocase()
, but performed as defined in the ECMAScript specification.
As these take an instance of the basic_regex
class as a parameter, they get charT
as a template parameter. So, they also can be implemented in a way similar to basic_regex
using the template specialization.
The proposed ECMAScrip2019
syntax supports lookbehind assertions, which none of the existing six engines of C++'s regex has support for. When performing a lookbehind assertion, the algorithm function reads the input sequence backwards. This raises a new issue.
When a user wants to find all the portions that some regular expression matches against some input sequence, the user will call regex_search
against the subrange [the end of the previous matched portion of the entire sequence, the end of the entire sequence) while the previous call succeeds. In this case, the subrange [the beginning of the entire sequence, the end of the previous matched portion of the entire sequence) is also a valid range and it is safe for an algorithm function to read a character in that subrange.
However, there is no way at this time to inform an algorithm function about the limit until where it can read backwards for lookbehind.
For example, when a user calls regex_search
with the regular expression /(?<!\d{2,})ABC123/
("ABC123" not preceded by two or more digit characters) against "ABC123ABC1234ABC12345", only the first six characters should be matched, because the second and later "ABC123"s are all preceded by two or more digits. But if there is no way to tell regex_search
about until where it can lookbehind, the second and later call to regex_search
against [end of previous matched subsequence, end of entire sequence) also return "abc123".
The match_prev_avail
flag is not suitable for this purpose. It only indicates that the preceding one point is a valid iterator position.
As of C++20, regex_search
takes as an input sequence 1) two bidirectional iterators that specify [begin, end), 2) const reference to an instance of std::basic_string
, or 3) a pointer to a null-terminated string. To fix the problem mentioned above, this paper proposes adding new overload functions that take three bidirectional iterators, which specify [begin, end) and the limit of lookbehind, to regex_search
.
This addition is useful only when an algorithm function is used with an instance of basic_regex
constructed with char8_t
, char16_t
, or char32_t
. If any variant of algorithm functions that takes three bidirectional iterators is called when its charT
is char
or wchar_t
, the third iterator for specifying the limit of lookbehind is simply ignored.
No specialization is proposed for regex_replace
. It does not do matching by itself but uses regex_iterator
that calls regex_search
internally.
It is preferable that 1) the new function group()
and its overload functions be added to match_results
for access to captured sequences by group name, and 2) the member function format()
be modified to support the replacement text symbol $<GROUPNAME>
that was introduced in ES2018.
However, as match_results
does not take charT
as a template parameter, it is not easy to implement something specific to the proposed ECMAScript2019
syntax option through the template specialization.
Thus, in this proposal, the new member function gname_to_gnumber()
and its overload functions that convert a group name to the group number assigned with it, are added instead into basic_regex
specialized for char8_t
, char16_t
, and char32_t
.
regex_iterator
is changed to use one variant of regex_search
that takes three bidirectional iterators mentioned above, instead of the current one that takes two bidirectional iterators. In this proposal this is the only proposed change that requires modifications to the existing implementations.
It is not expected that the change above causes ABI breaking, because it does not modify the members of regex_iterator
or the parameters of the member functions at all.
The link to a sample implementation based on the method above is shown in the Appendix section of this document.
The following changes are proposed:
2
The following subclauses describe a basic regular expression class template and its traits that can handle char-like (21.1) template arguments, twofive specializations of this class template that handle sequences of char
and, wchar_t
, char8_t
, char16_t
, and char32_t
, a class template that holds the result of a regular expression match, a series of algorithms that allow a character sequence to be operated upon by a regular expression, three specializations of this series that handle sequences of char8_t
, char16_t
, and char32_t
, and two iterator types for enumerating regular expression matches, as summarized in Table 122.
basic_regex
template<class charT, class traits = regex_traits<charT>> class basic_regex;
using regex = basic_regex<char>;
using wregex = basic_regex<wchar_t>;
using u8regex = basic_regex<char8_t>;
using u16regex = basic_regex<char16_t>;
using u32regex = basic_regex<char32_t>;
sub_match
template<class BidirectionalIterator>
class sub_match;
using csub_match = sub_match<const char*>;
using wcsub_match = sub_match<const wchar_t*>;
using u8csub_match = sub_match<const char8_t*>;
using u16csub_match = sub_match<const char16_t*>;
using u32csub_match = sub_match<const char32_t*>;
using ssub_match = sub_match<string::const_iterator>;
using wssub_match = sub_match<wstring::const_iterator>;
using u8ssub_match = sub_match<u8string::const_iterator>;
using u16ssub_match = sub_match<u16string::const_iterator>;
using u32ssub_match = sub_match<u32string::const_iterator>;
match_results
template<class BidirectionalIterator,
class Allocator = allocator<sub_match<BidirectionalIterator>>>
class match_results;
using cmatch = match_results<const char*>;
using wcmatch = match_results<const wchar_t*>;
using u8cmatch = match_results<const u8char_t*>;
using u16cmatch = match_results<const u16char_t*>;
using u32cmatch = match_results<const u32char_t*>;
using smatch = match_results<string::const_iterator>;
using wsmatch = match_results<wstring::const_iterator>;
using u8smatch = match_results<u8string::const_iterator>;
using u16smatch = match_results<u16string::const_iterator>;
using u32smatch = match_results<u32string::const_iterator>;
regex_search
template<class BidirectionalIterator, class Allocator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class charT, class Allocator, class traits>
bool regex_search(const charT* str,
match_results<const charT*, Allocator>& m,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class charT, class traits>
bool regex_search(const charT* str,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class ST, class SA, class charT, class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class ST, class SA, class Allocator, class charT, class traits>
bool regex_search(const basic_string<charT, ST, SA>& s,
match_results<typename basic_string<charT, ST, SA>::const_iterator,
Allocator>& m,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class ST, class SA, class Allocator, class charT, class traits>
bool regex_search(const basic_string<charT, ST, SA>&&,
match_results<typename basic_string<charT, ST, SA>::const_iterator,
Allocator>&,
const basic_regex<charT, traits>&,
regex_constants::match_flag_type
= regex_constants::match_default) = delete;
template<class BidirectionalIterator, class Allocator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char8_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char8_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char16_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char16_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char32_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char32_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
regex_iterator
template<class BidirectionalIterator,
class charT = typename iterator_traits<BidirectionalIterator>::value_type,
class traits = regex_traits<charT>>
class regex_iterator;
using cregex_iterator = regex_iterator<const char*>;
using wcregex_iterator = regex_iterator<const wchar_t*>;
using u8cregex_iterator = regex_iterator<const char8_t*>;
using u16cregex_iterator = regex_iterator<const char16_t*>;
using u32cregex_iterator = regex_iterator<const char32_t*>;
using sregex_iterator = regex_iterator<string::const_iterator>;
using wsregex_iterator = regex_iterator<wstring::const_iterator>;
using u8sregex_iterator = regex_iterator<u8string::const_iterator>;
using u16sregex_iterator = regex_iterator<u16string::const_iterator>;
using u32sregex_iterator = regex_iterator<u32string::const_iterator>;
regex_token_iterator
template<class BidirectionalIterator,
class charT = typename iterator_traits<BidirectionalIterator>::value_type,
class traits = regex_traits<charT>>
class regex_token_iterator;
using cregex_token_iterator = regex_token_iterator<const char*>;
using wcregex_token_iterator = regex_token_iterator<const wchar_t*>;
using u8cregex_token_iterator = regex_token_iterator<const char8_t*>;
using u16cregex_token_iterator = regex_token_iterator<const char16_t*>;
using u32cregex_token_iterator = regex_token_iterator<const char32_t*>;
using sregex_token_iterator = regex_token_iterator<string::const_iterator>;
using wsregex_token_iterator = regex_token_iterator<wstring::const_iterator>;
using u8sregex_token_iterator = regex_token_iterator<u8string::const_iterator>;
using u16sregex_token_iterator = regex_token_iterator<u16string::const_iterator>;
using u32sregex_token_iterator = regex_token_iterator<u32string::const_iterator>;
namespace pmr {
template<class BidirectionalIterator>
using match_results =
std::match_results<BidirectionalIterator,
polymorphic_allocator<sub_match<BidirectionalIterator>>>;
using cmatch = match_results<const char*>;
using wcmatch = match_results<const wchar_t*>;
using u8cmatch = match_results<const char8_t*>;
using u16cmatch = match_results<const char16_t*>;
using u32cmatch = match_results<const char32_t*>;
using smatch = match_results<string::const_iterator>;
using wsmatch = match_results<wstring::const_iterator>;
using u8smatch = match_results<u8string::const_iterator>;
using u16smatch = match_results<u16string::const_iterator>;
using u32smatch = match_results<u32string::const_iterator>;
}
namespace std::regex_constants {
using syntax_option_type = T1;
inline constexpr syntax_option_type icase = unspecified ;
inline constexpr syntax_option_type nosubs = unspecified ;
inline constexpr syntax_option_type optimize = unspecified ;
inline constexpr syntax_option_type collate = unspecified ;
inline constexpr syntax_option_type ECMAScript = unspecified ;
inline constexpr syntax_option_type basic = unspecified ;
inline constexpr syntax_option_type extended = unspecified ;
inline constexpr syntax_option_type awk = unspecified ;
inline constexpr syntax_option_type grep = unspecified ;
inline constexpr syntax_option_type egrep = unspecified ;
inline constexpr syntax_option_type multiline = unspecified ;
inline constexpr syntax_option_type ECMAScript2019 = unspecified ;
inline constexpr syntax_option_type dotall = unspecified ;
}
1
The type syntax_option_type
is an implementation-defined bitmask type (16.4.2.2.4). Setting its elements has the effects listed in Table 124. A valid value of type syntax_option_type
shall have at most one of the grammar elements ECMAScript
, basic
, extended
, awk
, grep
, egrep
, ECMAScript2019
, set. If no grammar element is set, the default grammar is ECMAScript2019
when a value of type syntax_option_type
is passed to an instance of one of the specializations basic_regex<char8_t>
, basic_regex<char16_t>
, and basic_regex<char32_t>
; otherwise ECMAScript
.
...
Element | Effect(s) if set |
---|---|
icase |
Specifies that matching of regular expressions against a character container sequence shall be performed without regard to case. |
nosubs |
Specifies that no sub-expressions shall be considered to be marked, so that when a regular expression is matched against a character container sequence, no sub-expression matches shall be stored in the supplied match_results object.
|
optimize |
Specifies that the regular expression engine should pay more attention to the speed with which regular expressions are matched, and less to the speed with which regular expression objects are constructed. Otherwise it has no detectable effect on the program output. |
collate |
Specifies that character ranges of the form "[a-b]" shall be locale sensitive. This flag has no effect when the ECMAScript2019 engine is selected.
|
ECMAScript |
Specifies that the grammar recognized by the regular expression engine shall be that used by ECMAScript in ECMA-262 third edition, as modified in 30.13. See also: ECMA-262 third edition 15.10 If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
basic |
Specifies that the grammar recognized by the regular expression engine shall be that used by basic regular expressions in POSIX. See also: POSIX, Base Definitions and Headers, Section 9.3 If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
extended |
Specifies that the grammar recognized by the regular expression engine shall
be that used by extended regular expressions in POSIX. See also: POSIX, Base Definitions and Headers, Section 9.4 If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
awk |
Specifies that the grammar recognized by the regular expression engine shall
be that used by the utility awk in POSIX. If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
grep |
Specifies that the grammar recognized by the regular expression engine shall be that used by the utility grep in POSIX. If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
egrep |
Specifies that the grammar recognized by the regular expression engine shall be that used by the utility grep when given the -E option in POSIX. If this flag is passed to an instance of basic_regex<char8_t> , basic_regex<char16_t> , or basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
multiline |
Specifies that ^ shall match the beginning of a line and $ shall match the end of a line, if the ECMAScript or ECMAScript2019 engine is selected.
|
ECMAScript2019 |
Specifies that the grammar recognized by the regular expression engine and the behavior of an algorithm that uses an instance of basic_regex constructed with this flag shall be those used and performed by ECMAScript in ECMA-262 2019 or later with the u flag being set, as modified in 30.14.See also: ECMA-262 2019 21.2 If this flag is passed to an instance of basic_regex other than basic_regex<char8_t> , basic_regex<char16_t> , and basic_regex<char32_t> , it shall be interpreted as if no grammar element is set.
|
dotall |
Specifies that . shall match any code point including new-line characters, if the ECMAScript2019 engine is selected.
|
match_flag_type
[re.matchflag]
Element | Effect(s) if set |
---|---|
... |
... |
format_default |
When a regular expression match is to be replaced by a new string, the new string shall be constructed using the rules used by the ECMAScript replace function in ECMA-262 third edition, part 15.5.4.11 String.prototype.replace. In addition, during search and replace operations all non-overlapping occurrences of the regular expression shall be located and replaced, and sections of the input that did not match the expression shall be copied unchanged to the output string. |
basic_regex
[re.regex]
basic_regex
specializations [re.regex.special]
1
The header <regex> defines three specializations of the class template basic_regex
: basic_regex<char8_t>
, basic_regex<char16_t>
, and basic_regex<char32_t>
.
2
[Note:
These specializations are not required necessarily to be implemented separately; typical implementations will use an internal iterator class template that has specializations for char8_t
, char16_t
, and char32_t
to translate an input sequence of UTF-8, UTF-16, and UTF-32 respectively to a sequence of Unicode code points, and construct a finite state machine by parsing that translated sequence in a base class shared by these three specializations.
—end note]
3
These specializations shall not use regex_traits
to construct a internal finite state machine. [Note: Particularly, case folding, translating a character prior to comparison without regard to case, shall be performed as defined in ECMA-262 2019 or later, and shall not be performed as defined in traits::translate_nocase(c)
. —end note]
basic_regex<char8_t>
specializations [re.regex.special.char8_t]
namespace std {
template<>
class basic_regex<char8_t> {
public:
// types
using value_type = char8_t;
using traits_type = void;
using string_type = basic_string<char8_t>;
using flag_type = regex_constants::syntax_option_type;
using locale_type = locale;
// 30.5.1, constants
static constexpr flag_type icase = regex_constants::icase;
static constexpr flag_type nosubs = regex_constants::nosubs;
static constexpr flag_type optimize = regex_constants::optimize;
static constexpr flag_type multiline = regex_constants::multiline;
static constexpr flag_type ECMAScript2019 = regex_constants::ECMAScript2019;
static constexpr flag_type dotall = regex_constants::dotall;
// 30.8.7.1.1, construct/copy/destroy
basic_regex();
explicit basic_regex(const char8_t* p, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const char8_t* p, size_t len, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const basic_regex&);
basic_regex(basic_regex&&) noexcept;
template<class ST, class SA>
explicit basic_regex(const basic_string<char8_t, ST, SA>& p,
flag_type f = regex_constants::ECMAScript2019);
template<class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex(initializer_list<char8_t>, flag_type = regex_constants::ECMAScript2019);
~basic_regex();
basic_regex& operator=(const basic_regex&);
basic_regex& operator=(basic_regex&&) noexcept;
basic_regex& operator=(const char8_t* ptr);
basic_regex& operator=(initializer_list<char8_t> il);
template<class ST, class SA>
basic_regex& operator=(const basic_string<char8_t, ST, SA>& p);
// 30.8.7.1.2, assign
basic_regex& assign(const basic_regex& that);
basic_regex& assign(basic_regex&& that) noexcept;
basic_regex& assign(const char8_t* ptr, flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(const char8_t* p, size_t len, flag_type f);
template<class string_traits, class A>
basic_regex& assign(const basic_string<char8_t, string_traits, A>& s,
flag_type f = regex_constants::ECMAScript2019);
template<class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(initializer_list<char8_t>,
flag_type = regex_constants::ECMAScript2019);
// 30.8.7.1.3, const operations
unsigned mark_count() const;
unsigned gname_to_gnumber(const char8_t* p) const;
unsigned gname_to_gnumber(const char8_t* p, size_t len) const;
template<class string_traits, class A>
unsigned gname_to_gnumber(const basic_string<char8_t, string_traits, A>& s) const;
template<class InputIterator>
unsigned gname_to_gnumber(InputIterator first, InputIterator last) const;
flag_type flags() const;
// 30.8.7.1.4, locale
locale_type imbue(locale_type loc);
locale_type getloc() const;
// 30.8.7.1.5, swap
void swap(basic_regex&);
};
basic_regex();
1
Effects: Constructs an object of class basic_regex
that does not match any character sequence.
explicit basic_regex(const char8_t* p, flag_type f = regex_constants::ECMAScript2019);
2
Requires: p
shall not be a null pointer.
3
Throws: regex_error
if p
is not a valid regular expression.
4
Effects: Constructs an object of class basic_regex
; the object’s internal finite state machine is constructed from the regular expression contained in the array of char8_t
of length char_traits<char8_t>::length(p)
whose first element is designated by p
and whose elements represent a UTF-8 sequence, and interpreted according to the flags f
.
5
Ensures: flags()
returns f
. mark_count()
returns the number of marked sub-expressions within the expression.
basic_regex(const char8_t* p, size_t len, flag_type f = regex_constants::ECMAScript2019);
6
Requires: p
shall not be a null pointer.
7
Throws: regex_error
if p
is not a valid regular expression.
8
Effects: Constructs an object of class basic_regex
; the object’s internal finite state machine is constructed from the regular expression contained in the sequence of UTF-8 code units [p, p+len)
, and interpreted according the flags specified in f
.
9
Ensures: flags()
returns f
. mark_count()
returns the number of marked sub-expressions within the expression.
basic_regex(const basic_regex& e);
10
Effects: Constructs an object of class basic_regex
as a copy of the object e
.
11
Ensures: flags()
and mark_count()
return e.flags()
and e.mark_count()
, respectively.
basic_regex(basic_regex&& e) noexcept;
12
Effects: Move constructs an object of class basic_regex
from e
.
13
Ensures: flags()
and mark_count()
return the values that e.flags()
and e.mark_count()
, respectively, had before construction. e
is in a valid state with unspecified value.
template<class ST, class SA>
explicit basic_regex(const basic_string<char8_t, ST, SA>& s,
flag_type f = regex_constants::ECMAScript2019);
14
Throws: regex_error
if s
is not a valid regular expression.
15
Effects: Constructs an object of class basic_regex
; the object’s internal finite state machine is constructed from the regular expression contained in the string s
whose elements represent a UTF-8 sequence, and interpreted according to the flags specified in f
.
16
Ensures: flags()
returns f
. mark_count()
returns the number of marked sub-expressions within the expression.
template<class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
flag_type f = regex_constants::ECMAScript2019);
17
Throws: regex_error
if the sequence [first, last)
is not a valid regular expression.
18
Effects: Constructs an object of class basic_regex
; the object’s internal finite state machine is constructed from the regular expression contained in the sequence of UTF-8 code units [first, last)
, and interpreted according to the flags specified in f
.
19
Ensures: flags()
returns f
. mark_count()
returns the number of marked sub-expressions within the expression.
basic_regex(initializer_list<charT> il, flag_type f = regex_constants::ECMAScript2019);
20
Effects: Same as basic_regex(il.begin(), il.end(), f)
.
basic_regex& operator=(const basic_regex& e);
1
Effects: Copies e
into *this
and returns *this
.
2
Ensures: flags()
and mark_count()
return e.flags()
and e.mark_count()
, respectively.
basic_regex& operator=(basic_regex&& e) noexcept;
3
Effects: Move assigns from e
into *this
and returns *this
.
4
Ensures: flags()
and mark_count()
return the values that e.flags()
and e.mark_count()
, respectively, had before assignment. e
is in a valid state with unspecified value.
basic_regex& operator=(const charT* ptr);
5
Requires: ptr
shall not be a null pointer.
6
Effects: Returns assign(ptr)
.
basic_regex& operator=(initializer_list<charT> il);
7
Effects: Returns assign(il.begin(), il.end())
.
template<class ST, class SA>
basic_regex& operator=(const basic_string<charT, ST, SA>& p);
8
Effects: Returns assign(p)
.
basic_regex& assign(const basic_regex& that);
9
Effects: Equivalent to: return *this = that;
basic_regex& assign(basic_regex&& that) noexcept;
10
Effects: Equivalent to: return *this = std::move(that);
basic_regex& assign(const charT* ptr, flag_type f = regex_constants::ECMAScript2019);
11
Returns: assign(string_type(ptr), f)
.
basic_regex& assign(const charT* ptr, size_t len, flag_type f = regex_constants::ECMAScript2019);
12
Returns: assign(string_type(ptr, len), f)
.
template<class string_traits, class A>
basic_regex& assign(const basic_string<charT, string_traits, A>& s,
flag_type f = regex_constants::ECMAScript2019);
13
Throws: regex_error
if s
is not a valid regular expression.
14
Returns: *this
.
15
Effects: Assigns the regular expression contained in the string s
whose elements represent a UTF-8 sequence, interpreted according the flags specified in f
. If an exception is thrown, *this
is unchanged.
16
Ensures: If no exception is thrown, flags()
returns f
and mark_count()
returns the number of marked sub-expressions within the expression.
template<class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
flag_type f = regex_constants::ECMAScript2019);
17
Requires: InputIterator
shall meet the Cpp17InputIterator requirements (23.3.5.2).
18
Returns: assign(string_type(first, last), f)
.
basic_regex& assign(initializer_list<charT> il,
flag_type f = regex_constants::ECMAScript2019);
19
Effects: Same as assign(il.begin(), il.end(), f)
.
20
Returns: *this
.
unsigned mark_count() const;
1 Effects: Returns the number of marked sub-expressions within the regular expression.
unsigned gname_to_gnumber(const char8_t* p) const;
2
Returns: gname_to_gnumber(string_type(p))
.
unsigned gname_to_gnumber(const char8_t* p, size_t len) const;
3
Returns: gname_to_gnumber(string_type(p, len))
.
template<class string_traits, class A>
unsigned gname_to_gnumber(const basic_string<char8_t, string_traits, A>& s) const;
4
Throws: error_backref
if s
is an empty string or the marked sub-expression assigned with the group name being identical to the UTF-8 string s
does not exist within the regular expression.
5
Effects: Returns the group number of the marked sub-expression assigned with the group name being identical to the UTF-8 string s
, within the regular expression.
template<class InputIterator>
unsigned gname_to_gnumber(InputIterator first, InputIterator last) const;
6
Requires: InputIterator
shall meet the Cpp17InputIterator requirements (23.3.5.2).
7
Returns: gname_to_gnumber(string_type(first, last))
.
flag_type flags() const;
8
Effects: Returns a copy of the regular expression syntax flags that were passed to the object’s constructor or to the last call to assign
.
locale_type imbue(locale_type loc);
1
Returns: locale_type()
.
locale_type getloc() const;
2
Returns: locale_type()
.
void swap(basic_regex& e);
1
Effects: Swaps the contents of the two regular expressions.
2
Ensures: *this
contains the regular expression that was in e
, e
contains the regular expression that was in *this
.
3
Complexity: Constant time.
basic_regex<char16_t>
specializations [re.regex.special.char16_t]
namespace std {
template<>
class basic_regex<char16_t> {
public:
// types
using value_type = char16_t;
using traits_type = void;
using string_type = basic_string<char16_t>;
using flag_type = regex_constants::syntax_option_type;
using locale_type = locale;
// 30.5.1, constants
static constexpr flag_type icase = regex_constants::icase;
static constexpr flag_type nosubs = regex_constants::nosubs;
static constexpr flag_type optimize = regex_constants::optimize;
static constexpr flag_type multiline = regex_constants::multiline;
static constexpr flag_type ECMAScript2019 = regex_constants::ECMAScript2019;
static constexpr flag_type dotall = regex_constants::dotall;
// construct/copy/destroy
basic_regex();
explicit basic_regex(const char16_t* p, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const char16_t* p, size_t len, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const basic_regex&);
basic_regex(basic_regex&&) noexcept;
template<class ST, class SA>
explicit basic_regex(const basic_string<char16_t, ST, SA>& p,
flag_type f = regex_constants::ECMAScript2019);
template<class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex(initializer_list<char16_t>, flag_type = regex_constants::ECMAScript2019);
~basic_regex();
basic_regex& operator=(const basic_regex&);
basic_regex& operator=(basic_regex&&) noexcept;
basic_regex& operator=(const char16_t* ptr);
basic_regex& operator=(initializer_list<char16_t> il);
template<class ST, class SA>
basic_regex& operator=(const basic_string<char16_t, ST, SA>& p);
// assign
basic_regex& assign(const basic_regex& that);
basic_regex& assign(basic_regex&& that) noexcept;
basic_regex& assign(const char16_t* ptr, flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(const char16_t* p, size_t len, flag_type f);
template<class string_traits, class A>
basic_regex& assign(const basic_string<char16_t, string_traits, A>& s,
flag_type f = regex_constants::ECMAScript2019);
template<class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(initializer_list<char16_t>,
flag_type = regex_constants::ECMAScript2019);
// const operations
unsigned mark_count() const;
unsigned gname_to_gnumber(const char16_t* p) const;
unsigned gname_to_gnumber(const char16_t* p, size_t len) const;
template<class string_traits, class A>
unsigned gname_to_gnumber(const basic_string<char16_t, string_traits, A>& s) const;
template<class InputIterator>
unsigned gname_to_gnumber(InputIterator first, InputIterator last) const;
flag_type flags() const;
// locale
locale_type imbue(locale_type loc);
locale_type getloc() const;
// swap
void swap(basic_regex&);
};
1
Same as the specification of class basic_regex<char8_t>
specialization, except that the words char8_t
and UTF-8 that appear in the text are replaced with char16_t
and UTF-16, respectively.
If saying "Same as the specification of ..." is not appropriate, the previous subclause will be rewritten like [re.regex.special.char8_t].
basic_regex<char16_t>
specializations [re.regex.special.char32_t]
namespace std {
template<>
class basic_regex<char32_t> {
public:
// types
using value_type = char32_t;
using traits_type = void;
using string_type = basic_string<char32_t>;
using flag_type = regex_constants::syntax_option_type;
using locale_type = locale;
// 30.5.1, constants
static constexpr flag_type icase = regex_constants::icase;
static constexpr flag_type nosubs = regex_constants::nosubs;
static constexpr flag_type optimize = regex_constants::optimize;
static constexpr flag_type multiline = regex_constants::multiline;
static constexpr flag_type ECMAScript2019 = regex_constants::ECMAScript2019;
static constexpr flag_type dotall = regex_constants::dotall;
// construct/copy/destroy
basic_regex();
explicit basic_regex(const char32_t* p, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const char32_t* p, size_t len, flag_type f = regex_constants::ECMAScript2019);
basic_regex(const basic_regex&);
basic_regex(basic_regex&&) noexcept;
template<class ST, class SA>
explicit basic_regex(const basic_string<char32_t, ST, SA>& p,
flag_type f = regex_constants::ECMAScript2019);
template<class ForwardIterator>
basic_regex(ForwardIterator first, ForwardIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex(initializer_list<char32_t>, flag_type = regex_constants::ECMAScript2019);
~basic_regex();
basic_regex& operator=(const basic_regex&);
basic_regex& operator=(basic_regex&&) noexcept;
basic_regex& operator=(const char32_t* ptr);
basic_regex& operator=(initializer_list<char32_t> il);
template<class ST, class SA>
basic_regex& operator=(const basic_string<char32_t, ST, SA>& p);
// assign
basic_regex& assign(const basic_regex& that);
basic_regex& assign(basic_regex&& that) noexcept;
basic_regex& assign(const char32_t* ptr, flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(const char32_t* p, size_t len, flag_type f);
template<class string_traits, class A>
basic_regex& assign(const basic_string<char32_t, string_traits, A>& s,
flag_type f = regex_constants::ECMAScript2019);
template<class InputIterator>
basic_regex& assign(InputIterator first, InputIterator last,
flag_type f = regex_constants::ECMAScript2019);
basic_regex& assign(initializer_list<char32_t>,
flag_type = regex_constants::ECMAScript2019);
// const operations
unsigned mark_count() const;
unsigned gname_to_gnumber(const char32_t* p) const;
unsigned gname_to_gnumber(const char32_t* p, size_t len) const;
template<class string_traits, class A>
unsigned gname_to_gnumber(const basic_string<char32_t, string_traits, A>& s) const;
template<class InputIterator>
unsigned gname_to_gnumber(InputIterator first, InputIterator last) const;
flag_type flags() const;
// locale
locale_type imbue(locale_type loc);
locale_type getloc() const;
// swap
void swap(basic_regex&);
};
1
Same as the specification of class basic_regex<char8_t>
specialization, except that the words char8_t
and UTF-8 that appear in the text are replaced with char32_t
and UTF-32, respectively.
If saying "Same as the specification of ..." is not appropriate, the previous subclause will be rewritten like [re.regex.special.char8_t].
regex_search
[re.alg.search]
Addition of variants that take three bidirectional iterators also to non-specialized regex_search
is for regex_itertor
and consistency.
template<class BidirectionalIterator, class Allocator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
9
Returns: regex_search(first, last, m, e, flags)
.
template<class BidirectionalIterator, class Allocator, class charT, class traits>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
const basic_regex<charT, traits>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
10
Returns: regex_search(first, last, e, flags)
.
regex_search
specializations [re.alg.search.special]
1
The header <regex> defines three specializations of the function template regex_search
that take as one of parameters an instance of basic_regex<char8_t>
, basic_regex<char16_t>
, and basic_regex<char32_t>
.
2
[Note:
These specializations are not required necessarily to be implemented separately; typical implementations will use an internal iterator class template that has specializations for char8_t
, char16_t
, and char32_t
to translate an input sequence of UTF-8, UTF-16, and UTF-32 respectively to a sequence of Unicode code points, and compare that translated sequence with the passed finite state machine in a base function shared by these three specializations.
—end note]
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char8_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
3
Requires: Type BidirectionalIterator
shall meet the Cpp17BidirectionalIterator requirements (23.3.5.5).
4
Effects: Determines whether there is some sub-sequence within the UTF-8 sequence [first, last)
that matches the regular expression e
. The iterator lookbehindlimit
is used to specify the limit until where reading the UTF-8 sequence backwards can be performed. If first != lookbehindlimit
then ^
shall match lookbehindlimit
instead of first
. The parameter flags
is used to control how the expression is matched against the UTF-8 sequence. Returns true
if such a sequence exists, false
otherwise.
5
Ensures: m.ready() == true
in all cases. If the function returns false
, then the effect on parameter m
is unspecified except that m.size()
returns 0
and m.empty()
returns true
. Otherwise the effects on parameter m
are given in Table 130.
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char8_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
6
Returns: regex_search(first, last, first, m, e, flags)
.
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char16_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
7
Requires: Type BidirectionalIterator
shall meet the Cpp17BidirectionalIterator requirements (23.3.5.5).
8
Effects: Determines whether there is some sub-sequence within the UTF-16 sequence [first, last)
that matches the regular expression e
. The iterator lookbehindlimit
is used to specify the limit until where reading the UTF-16 sequence backwards can be performed. If first != lookbehindlimit
then ^
shall match lookbehindlimit
instead of first
. The parameter flags
is used to control how the expression is matched against the UTF-16 sequence. Returns true
if such a sequence exists, false
otherwise.
9
Ensures: m.ready() == true
in all cases. If the function returns false
, then the effect on parameter m
is unspecified except that m.size()
returns 0
and m.empty()
returns true
. Otherwise the effects on parameter m
are given in Table 130.
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char16_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
10
Returns: regex_search(first, last, first, m, e, flags)
.
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
BidirectionalIterator lookbehindlimit,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char32_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
11
Requires: Type BidirectionalIterator
shall meet the Cpp17BidirectionalIterator requirements (23.3.5.5).
12
Effects: Determines whether there is some sub-sequence within the UTF-32 sequence [first, last)
that matches the regular expression e
. The iterator lookbehindlimit
is used to specify the limit until where reading the UTF-32 sequence backwards can be performed. If first != lookbehindlimit
then ^
shall match lookbehindlimit
instead of first
. The parameter flags
is used to control how the expression is matched against the UTF-32 sequence. Returns true
if such a sequence exists, false
otherwise.
13
Ensures: m.ready() == true
in all cases. If the function returns false
, then the effect on parameter m
is unspecified except that m.size()
returns 0
and m.empty()
returns true
. Otherwise the effects on parameter m
are given in Table 130.
template<class BidirectionalIterator, class Allocator>
bool regex_search(BidirectionalIterator first, BidirectionalIterator last,
match_results<BidirectionalIterator, Allocator>& m,
const basic_regex<char32_t>& e,
regex_constants::match_flag_type flags = regex_constants::match_default);
14
Returns: regex_search(first, last, first, m, e, flags)
.
2
Effects: Initializes begin
and end
to a
and b
, respectively, sets pregex
to addressof(re)
, sets flags
to m
, then calls regex_search(begin, end,
. If this call returns begin
, match, *pregex, flags)false
the constructor sets *this
to the end-of-sequence iterator.
3
Otherwise, if the iterator holds a zero-length match, the operator calls:
regex_search(start, end, begin, match, *pregex,
flags | regex_constants::match_not_null | regex_constants::match_continuous)
If the call returns true
the operator returns *this
. Otherwise the operator increments start
and continues as if the most recent match was not a zero-length match.
4
If the most recent match was not a zero-length match, the operator sets flags
to flags | regex_constants::match_prev_avail
and calls regex_search(start, end, begin, match, *pregex, flags)
. If the call returns false
the iterator sets *this
to the end-of-sequence iterator. The iterator then returns *this
.
1
The regular expression grammar recognized by basic_regex
objects constructed with the ECMAScript flag is that specified by ECMA-262 third edition, except as specified below.
14
The behavior of the internal finite state machine representation when used to match a sequence of characters is as described in ECMA-262 third edition. The behavior is modified according to any match_flag_type
flags (30.5.2) specified when using the regular expression object in one of the regular expression algorithms (30.11). The behavior is also localized by interaction with the traits class template parameter as follows:
See also: ECMA-262 third edition 15.10
1 The following production within the ECMAScript2019 grammar is clarified as follows:
CharacterEscape::HexEscapeSequence
Return the numeric value of the code unit in UTF-16 that is the SV of HexEscapeSequence.
The undated version of the ECMAScript Specification is added to references.
The author of this document is aware of P1433 Compile Time Regular Expressions. Even in the case that any proposal based on CTRE becomes part of the C++ standard, it is envisioned that need for <regex> remains for situations where a sequence of regular expressions is settled at runtime.
P1844R0 was reviewed and discussed by SG16 in a telecon and at the C++ committee meeting in Belfast. The author of this document received the following feedback:
In general, it was well received, but there is considerable reluctance to investing in std::regex. Here is some of the specific feedback from the discussions:
Since the feedback recommends abandoning <regex> and rebuilding the proposal virtually, it is difficult to revise P1844R0 that proposed enhancement of regex, based on the feedback.
Regarding the second last opinion, it was intentional. I think that touching regex for char and wchar_t can lead to ABI breaking.
Regarding the last one, there does not seem to be a technical problem to change to throwing an error. If required, I change it.
This example can be used with char8_t
, char16_t
, char32_t
only. char
and wchar_t
versions are not implemented. If basic_regex
or an algorithm function is used with a type other than char8_t
, char16_t
, and char32_t
, assert(0)
is called.
It demonstrates that adding a new syntax option for char8_t
, char16_t
, and char32_t
through the template specialization is a real option.
All the classes and algorithms are declared in namespace regex_proposal
, instead of std
.