Author: | Pavol Droba <droba@topmail.sk> |
---|---|
Co-Author: | Thorsten Ottosen <nesotto@cs.aau.dk> |
Number: | N1872=05-0132 |
Date: | 2005-08-26 |
This proposal introduces several string manipulating algorithms for addition into the C++ standard library. Additionaly the paper also proposes a set of predicate classes that might be used together with the algorithms mentioned here.
String processing is a part of almost every non-trivial program. There is always a need to process some input, extract some information from a text or format a textual output. Although the range of operations that need to be performed with strings is almost unlimited, some of them are very common. Many of these tend to be reinvented continuously by C++ programmers.
The C++ standard library supports string processing mainly by the basic_string class. In addition the <algorithm> header contains several algorithms that might by be applied in the string processing tasks, i.e. search, replace, etc. Compared to other programming languages like perl, the options are very limited.
This paper tries to fill this deficiency by introducing some of the often used algorithms into the standard library.
The proposal is based on the Boost String Algorithms Library. The library is in Boost since the version 1.32 and it is well accepted (See references).
Algorithms mentioned here are also present in some form in many general-purpose C++ libraries. Among others CString from MFC or QString from Qt. They are also standard operations in scripting languages like Perl, Python or Ruby.
The proposal is directly dependant on other two proposed extensions. Range Library (N1871) and rvalue reference proposal (N1770). No other changes in the core language or in the existing parts of the standard library are required.
While a 'string' if often represented by basic_string class, this is not the only representation. There are several alternatives that are widely used. Examples are FlexiString from the Loki library or rope from the STLPort implementation of the C++ standard library. Algorithms presented here are therefore designed to work with generic structures that satisfy algorithm requirements, rather than with a concrete string class.
To accommodate this design, the Range Library (N1871) is used as an interface for accessing the input strings. The Range Library provides a uniform access to a container structures. This allows virtually any container to be used with the algorithms presented here. Yet the algorithms are designed to work efficiently with string-like structures. An implication is that the elements of a string (a.k.a characters) are values that can be copied and moved around without additional overhead.
As mentioned earlier, this proposal is based on Boost String Algorithms Library. However it includes only a subset of the algorithms contained in the Boost library. Part of the library that deals with find and replace functionality is intentionally omitted due to several reasons:
Therefore we have decided to exclude these algorithms from this proposal. Yet it is possible that they might be a part of another proposal in the future.
There are few design differences between the proposal and the Boost library as well.
In the following n is an unsigned integer, s[n] denotes n-th element of the sequence s, v1 and v2 are variables of type range_value<Range>::type or elements of range r1 or r2 respectively. size(s) evaluates to the number of elements in s and finally result denotes the value returned by the algorithm.
These algorithms perform case-conversion on all characters in the string.
Synopsis:
template<typename CopyableRange> CopyableRange to_lower(const CopyableRange& s, const locale& loc=locale()); template<typename CopyableRange> CopyableRange&& to_lower(CopyableRange&& s, const locale& loc=locale());
Requires: CopyableRange satisfies the requirements of the copyable range (N1871) [1].
Effects: Converts elements in the input sequence to lower-case using the specified locales.
Returns: Modified copy of the input
Postcondition: for each n in [0, size(s)) result[n] == tolower(s[n], loc)
Synopsis:
template<typename CopyableRange> CopyableRange to_upper(const CopyableRange& s, const locale& loc=locale()); template<typename Sequence> CopyableRange&& to_upper(CopyableRange&& s, const locale& loc=locale());
Requires: CopyableRange satisfies the requirements of the copyable range (N1871) [1].
Effects: Converts elements in the input sequence to upper-case.
Returns: Modified copy of the input.
Postcondition: for each n in CopyableRange [0, size(s)) result[n] == toupper(s[n], loc)
[1] | (1, 2) An implicit requirement is implied from using tolower. ctype must be specialized for range_value<CopyableRange>::type and it must be imbued in the specified locales. |
These algorithms perform containment checks of various types and string comparison.
Synopsis:
template<typename Range1, typename Range2, typename Pred> bool starts_with(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool starts_with(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool istarts_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
Effects: Checks whether r2 is a prefix of r1 (Form 3 ignoring the case)
Example: Function IsURL() checks if the path denotes an url.
bool IsURL(const string& strPath) { return istarts_with(strPath, "http://") || istarts_with(strPath, "https://") || istarts_with(strPath, "ftp://"); }
Synopsis:
template<typename Range1, typename Range2, typename Pred> bool ends_with(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool ends_with(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool iends_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
Effects: Checks whether r2 is a suffix of r1 (Form 3 ignoring the case)
Example: Function IsExecutable() checks if the extension of the file name denotes an executable.
bool IsExecutable(const string& strPath) { return iends_with(strPath, ".com") || iends_with(strPath, ".exe") || iends_with(strPath, ".bat"); }
Synopsis:
template<typename Range1, typename Range2, typename Pred> bool contains(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool contains(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool icontains(const Range1& r1, const Range2& r2, const locale& loc=locale());
Effects: Checks whether r2 is a substring of r1 (Form 3 ignoring the case)
Equals algorithm is actually a two-way version of equal available in <algorithm> header. equal does not check if both operands have the same length. Moreover, it requires that the second argument is at least as long as the first one. Algorithm presented here fills this gap. In addition, case-insensitive comparison is provided.
Synopsis:
template<typename Range1, typename Range2, typename Pred> bool equals(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool equals(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool iequals(const Range1& r1, const Range2& r2, const locale& loc=locale);
Effects: Checks whether r1 and r2 are element-wise equal (Form 3 ignoring the case)
This algorithm has functionality equivalent to lexicographical_compare algorithm already present in the standard library. It provides simpler, range-oriented interface and case-insensitive version.
Synopsis:
template<typename Range1, typename Range2, typename Pred> bool lexicographical_compare(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool lexicographical_compare(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool ilexicographical_compare(const Range1& r1, const Range2& r2, const locale& loc=locale());
Effects: Performs the lexicographical comparison.
Returns: true if r1 is lexicographically less then r2, false otherwise.
Synopsis:
template<typename Range, typename Pred> bool all(const Range& r1, Pred pred);
Requires: Range satisfies the requirements of the range concept (N1871).
Effects: Check if all the elements of the range satisfy the condition of the predicate pred.
Returns: true if for each n<size(r1) pred(r1[n]), false otherwise
Example: Check if all characters in the string are lower case. [4]
if(all(str, is_lower())) { // All characters are lower case }
Synopsis:
template<typename Range, typename Pred> bool none(const Range& r1, Pred pred);
Requires: Range satisfies the requirements of the range concept (N1871).
Effects: Check if none of the elements in the range satisfy the condition of the predicate pred.
Returns: true if for each n<size(r1) !pred(r1[n]), false otherwise
[2] | (1, 2, 3, 4, 5) An implicit requirement is implied from using tolower. ctype must be specialized for range_value<Range>::type and it must be imbued in the specified locales. |
[3] | (1, 2) The names all and none might be considered too general. Yet they describe the algorithms very well and we haven't found a better variant so far. If this will be considered a problem, alternative names might be used. |
[4] | is_lower predicate is presented later in this proposal. |
Trimming algorithms remove unnecessary spaces from the beginning and the end of a string.
Synopsis:
template<typename Sequence> Sequence trim_left(const Sequence& s, const locale& loc=locale()); template<typename Sequence> Sequence&& trim_left(Sequence&& s, const locale& loc=locale());
template<typename Sequence, typename Pred> Sequence trim_left_if(const Sequence& s, Pred pred); template<typename Sequence> Sequence&& trim_left_if(Sequence&& s, Pred pred);
Effects: Removes all leading spaces (or characters satisfying pred) from the beginning of the sequence
Returns: Modified copy of the input
Synopsis:
template<typename Sequence> Sequence trim_right(const Sequence& s, const locale& loc=locale); template<typename Sequence> Sequence&& trim_right(Sequence&& s, const locale& loc=locale);
template<typename Sequence, typename Pred> Sequence trim_right_if(const Sequence& s, Pred pred); template<typename Sequence> Sequence&& trim_right_if(Sequence&& s, Pred pred);
Effects: Removes all trailing spaces (or characters satisfying pred) from the end of the sequence
Returns: Modified copy of the input
Synopsis:
template<typename Sequence> Sequence trim(const Sequence& s, const locale& loc=locale()); template<typename Sequence> Sequence&& trim(Sequence&& s, const locale& loc=locale());
template<typename Sequence, typename Pred> Sequence trim_if(const Sequence& s, Pred pred); template<typename Sequence> Sequence&& trim_if(Sequence&& s, Pred pred);
Returns: Modified copy of the input
[5] | (1, 2) Alternative names that might be considered are trim_start and trim_end. This would be more consistent with starts_with and end_withs, however 'TrimLeft/Right' is more often used term in currently existing libraries. |
[6] | (1, 2, 3) An implicit requirement is implied from using tolower. ctype must be specialized for range_value<CopyableRange>::type and it must be imbued in the specified locales. |
Two categories of predicates are given.
It is noteworthy that the predicates are not staticaly bound to a character type that is to be processed, rather the operation it templated. This approach simplifies the usage of predicates, since the types are automaticaly infered from the input without the need to be explicitly specified by a user.
template<typename Derived> struct predicate_facade {};
The purpose of the predicate_facade class is to tag classification predicates, so that they can be used together with the predicate combinators. Without such a facility it would not be possible to avoid ambiguities with other logical operators.
Every predicate should be therefore derived from predicate_facade.
Synopsis:
class is_classified_pred : public predicate_facade<is_classified_pred> { public: // Constructor is_classified_pred(ctype_base::mask Type, locale const & Loc = locale()); // Operation template<typename Char> bool operator()(Char Ch) const; }; is_classified_pred is_classified(ctype_base::mask Type, const locale& Loc=locale()); is_classified_pred is_space(const locale& Loc=locale()); is_classified_pred is_alnum(const locale& Loc=locale()); is_classified_pred is_alpha(const locale& Loc=locale()); is_classified_pred is_cntrl(const locale& Loc=locale()); is_classified_pred is_digit(const locale& Loc=locale()); is_classified_pred is_graph(const locale& Loc=locale()); is_classified_pred is_lower(const locale& Loc=locale()); is_classified_pred is_print(const locale& Loc=locale()); is_classified_pred is_punct(const locale& Loc=locale()); is_classified_pred is_upper(const locale& Loc=locale()) is_classified_pred is_xdigit(const locale& Loc=locale());
Class is_classified_pred is a predicate that checks whether a character belongs to a character class specified in constructor. Character classes are defined in the ctype_base class. The given locales are used to perform the actual classification.
The functions is_* are the preferred form of construction for the is_classified_pred class.
Synopsis:
template<typename Char> class is_any_of_pred : public predicate_facade<is_any_of_pred<Char> > { public: // Constructor template<typename Range> is_any_of_pred(const Range& r); // Operation template<typename Char2> bool operator()(Char2 c) const; }; template<typename Range> is_any_of_pred<typename range_value<Range>::type > is_any_of(const Range& r);
is_any_of_pred predicate checks whether a character is member of a given set. A range specified at the construction is used to initialize the control set.
is_any_of constructs the is_from_range_pred predicate.
Synopsis:
template<typename Char> class is_from_range_pred : public predicate_facade<is_from_range_pred<Char> > { public: // Constructor is_from_range_pred(Char from, Char to); // Operation template<typename Char2> bool operator()(Char2 c) const; }; template<typename Char> is_from_range_pred<Char> is_from_range(Char from, Char to);
is_from_range_pred checks whether a character belongs to a specified range of characters. In other words, it checks whether from <= c <= to holds.
Precondition: from <= to
Combinators allow to create an expression consisting of several classification predicates.
Example:
is_space() || is_any_of(as_literal("\"_"));
The result of this expression is a predicate that recognizes spaces, quotes and underscores.
Synopsis:
template<typename Pred1, typename Pred2> class and_pred : public predicate_facade< and_pred<Pred1,Pred2> > { public: // Constructor and_pred(Pred1 p1, Pred2 p2); // Operation template<typename Char> bool operator()(Char c) const; }; template<typename Pred1, typename Pred2> and_pred<Pred1, Pred2> operator&&(Pred1 p1, Pred2 p2);
Constructs a predicate that is a logical and of predicates p1 and p2.
Synopsis:
template<typename Pred1, typename Pred2> class or_pred : public predicate_facade<or_pred<Pred1,Pred2> > { public: // Constructor or_pred(Pred1 p1, Pred2 p2); // Operation template<typename Char> bool operator()(Char c) const; }; template<typename Pred1, typename Pred2> or_pred<Pred1, Pred2> operator||(Pred1 p1, Pred2 p2);
Constructs a predicate that is a logical or of predicates p1 and p2.
Synopsis:
template<typename Pred> class not_pred : public predicate_facade<not_pred<Pred> > { public: // Constructor not_pred(Pred p); // Operation template<typename Char> bool operator()(Char c) const; }; template<typename Pred> not_pred<Pred> operator!(Pred p);
Constructs a predicate that is a logical negation of predicate p.
The comparison predicates are primarily designed to be used by algorithms that perform character-level comparison during their computation [7].
Unlike the functional objects that are already present in the standard library, these predicates have a templated function operator. Therefore they are easier to use since it is not required to explicitly specify the argument type in the construction. Rather, C++ type matching is used.
So instead of writing:
string s1, s2; ... equals(s1, s2, equal_to<char>());
a user can write
equals(s1, s2, is_equal());
The difference becomes even more apparent when the input types are more complex.
In addition, the old function objects do not directly support comparison between different argument types (unless they are implicitly convertible to each other).
The proposed predicates do not support type identification facilities provided by unary_function and binary_function. This is primarily because it is not possible to support it with a dynamic operation type; secondarily, these facilities (unary_function, etc.) becomes obsolete when auto and decltype will become a part of standard.
Synopsis:
class is_equal { public: // Operator template<typename T1, typename T2> bool operator ()(const T1& arg1, const T2& arg2) const };
Effects: operator() returns arg1==arg2
Synopsis:
class is_iequal { public: // Constructor is_iequal(const locale& loc=locale()) // Operator template<typename T1, typename T2> bool operator ()(const T1& Arg1, const T2& Arg2) const };
Effects: operator() returns tolower(arg1,loc)==tolower(arg2,loc)
Synopsis:
class is_less { public: // Operator template<typename T1, typename T2> bool operator ()(const T1& arg1, const T2& arg2) const };
Effects: operator() returns arg1<arg2
Synopsis:
class is_iless { public: // Constructor is_iless(const locale& loc=locale()) // Operator template<typename T1, typename T2> bool operator ()(const T1& Arg1, const T2& Arg2) const };
Effects: operator() returns tolower(arg1,loc)<tolower(arg2,loc)
Synopsis:
class is_not_greater { public: // Operator template<typename T1, typename T2> bool operator ()(const T1& Arg1, const T2& Arg2) const };
Effects: operator() returns arg1<=arg2
Synopsis:
class is_not_igreater { public: // Constructor is_not_igreater(const locale& loc=locale()) // Operator template<typename T1, typename T2> bool operator ()(const T1& Arg1, const T2& Arg2) const };
Effects: operator() returns tolower(arg1,loc)<=tolower(arg2,loc)
[7] | In this proposal we suggested to add only operators for =, < and <=. These operators should be sufficient for the purposes they are targeted. Reverse operations can be achieved by simply revering the arguments in the algorithm call. Yet it is easy to add the negation operator or simply complete the set if such a requirement arises. |
The Range library does not have support for character literals and c-style strings. Two utilities are provided to allow these to be used with string algorithms.
The difference between the two utilities is in the way they handle Char[] literals. as_literal treats it as an ordinary null-terminated string, while as_array uses the array dimensions to determine the length.
In the following example:
char str[]="Hello"; range<char*> lit=as_literal(str); range<char*> arr=as_array(str);
variable lit will delimit elements 'h','e','l','l','o'. arr on the other end will point to elements 'h','e','l','l','o',0 including the terminating zero.
To pass a literal to an algorithm, one of the adapters must be used
if( starts_with(str, as_literal("start") ) { // Do something }
as_literal constructs a range that delimits the range specified as parameter. Char* and Char[] are considered to be null-terminate strings. strlen is used to determine the length of the string. Other range type are passed unmodified.
Synopsis:
template<typename Range> range<typename range_value<Range>::type> as_literal(const Range& r); template<typename Char> range<Char*> as_literal(Char* pch); template<typename Char, size_t sz> range<Char*> as_literal(Char (&arr)[sz])
Effect: Construct a range that delimits the input string.
as_array constructs a range that delimits the range specified as parameter. Char[] literals are considered to be C-arrays. The array dimension is used to determine the length of the string. Other range types are passed unmodified.
Synopsis:
template<typename Range> range<typename range_value<Range>::type> as_array(const Range& r); template<typename Char, size_t sz> range<Char*> as_array(Char (&arr)[sz])
Requires: Range must satisfy the range concept requirements (Range library proposal)
Effect: Constructs a range that delimits the input string.
This issue is rather philosophical, but it might have an impact on several issues discussed later. The proposal declares that it is targeted mainly for string processing. Algorithms are designed with this as a primary goal in mind. However some of them can be also used outside of this scope.
So the question is if the string affinity should be strongly declared or if on the other hand, the usage outside of this scope should be encouraged.
This proposal does not explicitly state the header files where the algorithms should be located. There are several possible options.
Distributed approach
No new headers will be created. Rather, the entities from this proposal will be distributed among the existing headers. Algorithms will go into <algorithm> header, predicates to <functional>.
This approach is a good candidate if string affinity (discussed in as the first issue) is not strongly declared.
Monolithic approach
New header will be created, named <stringalgo> that will encompass all the entities presented here. Optionally the entities can be added to <string> header.
Again, this approach is the natural choice if the string affinity of the algorithms is strengthened.
During the writing of this proposal, the Range Library proposal had not been finished. There are some issues that arise due to the interaction with this library. These issues will have to be solved before adopting these two proposals together.
Probably the most problematic is the support for character literals. There are two different views on the literal:
The compiler treats literals as C-arrays with the length that includes the terminating zero. In contrast, when using 'null-terminated string' view, zero at the end is only a terminator and is not considered to be a part of the string. Unfortunately, there is no facility in the language that can be used to specify this intention. Therefor the user must know what the algorithm expects and provide the input in the required form.
Prior versions of Boost Range Library contain a special handling for character literal. char[] and wchar_t[] were treated as null-terminated strings. There was also support for char* and wchar_t*. Later this special handling was considered dangerous, since the other C-array types were handled differently (using the array size for determining the boundaries). Current proposal supports only C-arrays and it does so uniformly.
However, this is not natural for string processing. For example contains(str, "hello") will look for 'h','e','l','l','o',0 rather then for "hello".
We suggest two possible solutions: explicit and implicit.
As the name suggests, all literal handling should be explicit. When using a literal as_literal of as_array will have to be used. To avoid problems arising from unexpected behaviour, range library should be adapted to disallow implicit usage of C-arrays.
For example:
starts_with(str, as_literal("hello")); char arr[]={'b','y','e'}; starts_with(str, as_array(arr));
The advantage of this approach is clarity and correctness. It is not possible to confuse a C-array and a literal, because the true intention must be specified explicitly. Passing an argument directly will generate a compiler error.
Disadvantages are mainly of cosmetic nature. Code becomes more verbose. Using a manipulator at every place might become tedious.
In this solution, Range Library would support C-arrays directly. However, algorithms will be able to provide support for literals if appropriate. as_literal is designed in such a way that it processes only null-terminated strings and C-arrays. For all other range types it is transparent, acting as an identity function. Therefore it is possible to write an algorithm as follows:
template<typename Range> bool AlgoImpl(const Range& r); template<typename Range> bool Algo(const Range& r) { return AlgoImpl(as_literal(r)); }
This simple construct extends the set of accepted types by character literals, even when there is no direct support for them in the Range Library. In addition null-terminated strings (char* and wchar_t* and character based C-arrays will be treated as C-strings.
When this handling is inappropriate, user can explicitly cast the input to array using as_array construct:
char arr[]={'b','y','e'}; Algo(as_array(arr));
as_array will convert the input to range. Because that is not a type that as_literal directly processes, it will be passed through unchanged.
This way, the algorithms will need to be clearly marked, as to what types of arguments they accept and how they process them in the documentation. String affinity of algorithms will be strengthened.
The advantage of this approach lies mainly in the simplification of code from the user perspective. In majority of cases the user will not be forced to make explicit specification, implicit handling will work just fine.
On the other hand, there remains the possibility for errors. Different algorithms will use the same arguments in different ways. Such behaviour might be confusing.
The choice of algorithms is not based on some rigorous study, rather it comes from programming experience and various discussions. Even if we believe that the selection covers most essential ones, there is always place for improvements and possible additions.
This topic is therefore open for discussion.