Document number: | N1500 = 03-0083 |
Date: | September 22, 2003 |
Project: | Programming Language C++ |
Reference: | ISO/IEC IS 14882:1998(E) |
Reply to: | Pete Becker |
Dinkumware, Ltd. | |
petebecker@acm.org |
Give me but one firm spot on which to stand, and I will move the earth.
-- Archimedes on the lever, quoted in The Oxford Dictionary of Quotations, Second Edition, Oxford University Press, London, 1953, p. 14
Regular expressions are a powerful tool that can be used to search text for occurrences of patterns of characters. If the formal language used to define the search pattern is too hard to understand or too hard to apply, however, the user has no firm spot on which to stand, and it will be the fulcrum, not the earth, that moves. Designing this formal language requires recognizing that flexibility and understandability can easily become antagonists.
The regular expression proposal (N1429) is based on the regular expression support in ECMAScript and in POSIX. It offers richer internationalization and customization capabilities than either of these standards. This paper reviews the internationalization support in the two underlying standards as well as the support in the regular expression proposal, and recommends that we be cautious in enhancing the capabilities of regular expressions beyond those provided by these two standards.
When using regular expressions, character sequences occur in two distinct contexts: the pattern text which defines the regular expression itself and the target text which is the text to be searched for occurrences of the pattern. A regular expression implementation typically compiles the pattern text into an intermediate representation which is used by the search algorithms to find matches in the target text.
A regular expression pattern uses a set of
special characters (e.g. the familiar
".*"
) to invoke special
matching rules. All characters that are not special characters are
ordinary characters. Ordinary characters
in the pattern text match characters in the target text only if the two characters
have exactly the same bit pattern. Other than in a POSIX bracket expression there are no
special considerations for variable width characters in the pattern text. Consequently,
a multi-character sequence in the pattern text matches the target text only if the same
sequence of characters occurs in the target text.
POSIX regular expressions support more flexible matching through the use of bracket expressions that refer to named classes. A named class can be one of three types: a collating symbol, an equivalence class, or a character class.
A collating symbol, of the form
[.ce.]
, matches a sequence of characters in the
target text only if that sequence of characters is the same collating element.
Thus, in a locale which defines the sequence "ch"
as a collating element,
the bracket expression [[.ch.]]
matches the entire text sequence
"ch"
, while the ordinary bracket expression [ch]
matches the first character of that text sequence.
An equivalence class, of the form
[=cl=]
, matches a sequence of characters in the
target text only if that sequence of characters is a collating element
in the same equivalence class as cl
according to the current
locale. Thus, the equivalence class [=a=]
will typically match various
accented forms of the letter 'a'
in a locale that recognizes
accented letters.
A character class, of the form
[:name:]
, matches a character in the target text
only if that character is in the set of characters named by name
.
Character class names include, but are not limited to,
d
,
w
,
s
,
alnum
,
alpha
,
blank
,
cntrl
,
digit
,
graph
,
lower
,
print
,
punct
,
space
,
upper
,
and xdigit
.
The contents of each of these named classes depends on the current locale.
ECMAScript does not have explicit internationalization support. For the most part it's not needed, because ECMAScript traffics only in Unicode characters. It also does not have POSIX-style bracket expressions. It does, however, have escape sequences that define several character classes.
The digit class, of the form "\d"
,
matches any character in the set of ASCII characters [0-9]
.
The word class, of the form "\w"
,
matches any character in the set of ASCII characters [a-zA-Z0-9_]
.
The space class, of the form "\s"
,
matches any of the characters
<TAB>
,
<VT>
,
<FF>
,
<SP>
,
<NBSP>
,
<USP>
,
<LF>
,
<CR>
,
<LS>
,
<PS>
.
A word boundary assert, of the form
"\b"
, tests whether the current
position in the text is at a word boundary. A word boundary is a transition from
a character that is not in the word class to a character that is in the
word class or vice versa.
Each of these four escape sequences has a corresponding negation, represented by
a backslash followed by the uppercase letter corresponding to the desired escape.
Thus not digit is "\D"
, not word is "\W"
,
not space is "\S"
, and not word boundary is
"\B"
.
When the user compiles a regular expression, one of the arguments specifies whether to treat the regular expression as POSIX or ECMAScript.
When the user specifies ECMAScript the regular expression can
use both POSIX named classes and
ECMAScript escape sequences. The three classes named
[:d:]
,
[:w:]
,
and [:s:]
are used to define the contents of the character sets named by the
ECMAScript escape sequences:
ECMAScript | POSIX |
escape | named class |
\d | [[:d:]] |
\w | [[:w:]] |
\s | [[:s:]] |
\D | [^[:d:]] |
\W | [^[:w:]] |
\S | [^[:s:]] |
The ECMAScript escape sequences \b
and \B
are
defined, as in ECMAScript, in terms of transitions between characters in the set
defined by \w
and by \W
.
This support for internationalization is provided through
member functions of a regex_traits
object.
Collating symbols are supported by the
member function lookup_collatename
, which determines whether a
character sequence is a valid collation element and, if so, converts that
sequence of characters into a canonical representation. During a text
search, the canonical representation for the collation element in the pattern
will be compared to the canonical representation for the current text element
to determine whether the match succeeds.
Equivalence classes are supported by the
member function transform_primary
, which, like strxfrm
,
converts a sequence of characters into a canonical representation. During a text
search, the canonical representation for the collation element in the pattern
will be compared to the canonical representation for the current text element
to determine whether the match succeeds.
Character classes are supported by the
member function lookup_classname
, which converts the name of
a class to a numeric value that can be passed, along with the current character,
to is_class
, which returns true if the character is in the class.
The regular expressions proposal also provides for customizing the rule
for matching an ordinary character in
the pattern text and a character in the target text, as well as for remapping the
special characters used in writing
regular expressions. Just as for internationalization, these customizations
are provided through member functions of a regex_traits
object.
Character matching is handled by the member function
translate
. It takes two arguments, a character and a boolean flag, and
returns a character. The boolean flag indicates whether the translation should be
case sensitive. Two characters match if the characters returned by calling
translate
with each of the two characters are equal.
Remapping special characters is handled by the member functions
syntax_type
and escape_syntax_type
. They are called only
when the pattern text is being compiled, and they return
enumerated types that tell the compiler what a character means. So, for example,
to recognize that any character can match the pattern at the current position,
the compiler code is if (traits_inst.syntax_type(ch) == syntax_dot)
.
[:d:]
, [:w:]
, [:s:]
added to POSIXThis enhancement is a conforming extension to the POSIX regular expression specification.
ECMAScript does not support locales. That's a reasonable (although a bit limiting) choice for a regular expression specification that uses only Unicode. Since the C++ regular expression proposal supports byte-sized character types, locales are necessary.
This is technically a non-conforming change to ECMAScript. For example,
it quietly changes the meaning of the regular expression "[[:d:]]
.
Under ECMAScript, that expression matches the full text "[]"
--
the bracket expression (which ends at the first ']') matches any of the
characters '['
, ':'
, 'd'
, and the final
']'
in the pattern matches the final ']'
in the
text. With this enhancement, the match fails. Of course, that regular expression
is a rather peculiar way to try to match those three characters
The usual way to write that match would be "[[:d]"
,
which becomes invalid with the enhancement. In practice this shouldn't
be a significant problem.
However, arbitrary named character classes for large character sets can
be expensive to use. For example, testing whether a Unicode character is a
letter requires large tables and several lookups. This is probably why ECMAScript's
escape sequences are defined only for specific sets of ASCII code points. This
keeps the lookup tables small. Java's regex
package also has
the usual POSIX named character classes, but they, too, only hold
ASCII code points, not all of the possible Unicode characters.
The Java package also provides named classes for the
various Unicode character classifications, which do need the large tables
and multiple lookups.
Since the regular expression proposal deals with arbitrary character types, it isn't feasible to restrict named character classes to ASCII values (which might not be the same characters under some locales). So the provision for unrestricted named classes in this proposal is reasonable.
Case insensitive matches need to use a locale, so calling
translate
is a reasonable requirement for such matches. The
broader requirement that every character match be done with translate
,
however, isn't so clearly useful, and can impose a significant
performance cost. For case sensitive comparisons, both ECMAScript and POSIX require
that the character values be the same. This, in turn, means that comparing long
sequences that contain only text can be done with memcmp
:
if (memcmp(cur, pat, nchrs))
// match failed
Using translate
in the comparison requires an explicit loop:
for (int i = 0; i < nchrs; ++i)
if (traits_inst.translate(cur[i], false) != traits_inst.translate(pat[i], false))
// match failed
I've suggested elsewhere that we change the interface to provide two versions
of translate
, one for case sensitive matches and one for case insensitive
matches. With that change, the default implementation for case sensitive matches becomes
simple return ch;
, which the compiler can easily inline. In effect, the
previous loop becomes:
for (int i = 0; i < nchrs; ++i)
if (cur[i] != pat[i])
// match failed
While that's a significant improvement over the previous version, it's not as good
as calling memcmp
. A compiler might recognize this pattern and generate
fast code, but that's not something users can rely on. And, of course, a non-trivial
version of translate
would be far slower.
The proposal says that translate
is needed to support various locale-specific
forms of canonical or compatibility equivalence. That's what
equivalence classes do. Equivalence classes are a
clear signal in the pattern text that a potentially expensive comparison will be made.
With a user-defined translate
there is no such signal.
The proposal says that remapping special characters allows users to
"use Han-ideographs rather that Latin punctuation
symbols in place of the usual ? * + regular expression operators."
That's certainly true, but it's not an obvious benefit. The problem is that
users can't look at a regular expression and know what it means. As an extreme example,
what is the result of matching the pattern
"[abcd](efg)"
against the target text
"abcdefg"
? With the usual regular expression operators,
the match is "defg"
, with a subexpression matching
"efg"
. But with a simple change to the function
syntax_type
, the match would be "abcde"
,
with a subexpression matching "abcd"
.
A regular expression is a statement in a programming language. Just as we do not support remapping C++ operators, we should not support remapping regular expression operators.
Under the principle that you don't pay for it if you don't use it, character equality should be based on code points, without an intervening function call, unless the user explicitly asks for slower but more flexible comparisons at the time the regular expression is compiled.
Under the princpile that code should say what it means, special characters should not be remapped.