Author: | Pavol Droba <droba@topmail.sk> |
---|---|
Number: | N2059=06-0129 |
Date: | 2006-09-08 |
This proposal introduces several string manipulating algorithms for addition into the C++ standard library.
String processing is a part of almost every non-trivial program. There is always a need to process some input, extract some information from a text or format a textual output. Although the range of operations that need to be performed with strings is almost unlimited, some of them are very common and they tend to be reinvented continuously by C++ programmers.
The C++ standard library supports string processing mainly by the basic_string class. In addition, the <algorithm> header contains several algorithms that might be applied in string processing tasks, i.e. search, replace, etc. Compared to other programming languages like perl, these options are very limited.
This paper tries to address this deficiency by introducing some of the often used algorithms into the standard library.
This proposal is based on the Boost String Algorithms Library. The library is in Boost since the version 1.32 and it is well accepted (See references).
The algorithms mentioned here are also present in some form in many general-purpose C++ libraries. Among others CString from MFC or QString from Qt. They are also standard operations in scripting languages like Perl, Python or Ruby.
The proposal is directly dependent on other two proposed extensions: Range Library (N2068) and rvalue reference proposal (N1770). No other changes in the core language or in existing parts of the standard library are required.
Both requirements can be refined/relaxed to be compatible with C++2003 version of the standard. Range library can be, with some restrictions, redefined to use the current standard. The Boost Range Library implementation is a direct proof the of this claim. Algorithms that use rvalue reference to perform move semantics can also be altered to do simple in-place mutation. The main functionality of algorithms will be still maintained [1].
[1] | This is actually how the Boost String Algorithm Library is implemented. Instead of version that takes a && parameter, there is a version that takes a & parameter and mutates the argument directly. |
String processing is a wide application area. Therefore there is also a large quantity of various related algorithms. Obviously, not all of them are suitable for a general purpose library like the C++ Standard Library. Our goal is to provide the essential algorithms that are not available in the standard by other or require nontrivial effort to implemenent.
In the search for the 'essential ones' we have considered some general purpose C++ libraries and standard libraries for mainstream languages comparable to the C++ in the area of application. Each of these representatives have one or more string classes, each with a set of operations bound to them.
Here is the list of representatives:
Scripting languages have naturally been strong in the string processing. We include two representatives in our comparison. We have chosen these two because they are object oriented and widely used.
Algorithms presented in these libraries can be divided into the following categories:
This category consists of algorithms that directly modify a string. Operations include construction, concatenation, insertion and erasure.
Representatives:
All of the mentioned libraries including STL basic_string implement this to a reasonable extend.
Representatives:
Strongly-typed languages offer methods that extract a substring between a specified indexes. Some libraries have also methods for simple extraction from the start or the end of a string. Scripting languages like ruby combine all this functionality into the operator[].
basic_string is limited to one extraction method. It can perform all of the required operations at the cost of slightly more verbose code.
Searching for a substring in a string and replacing it with something else is one of the fundamental string operations.
Representatives:
Libraries offer various levels of flexibility for these operations. The basic level allows to search for the exact string. An improvement is the ability to ignore the case. Most powerful is searching using an expression that describes the substring. Regular expressions are the de facto standard in the last category.
Current C++ standard is limited to the basic level of functionality. Fortunately, this state was radically changed by accepting the Regex Library into TR1.
Note: split and its variant can be also considered to be a part of this category.
Probably the most frequently used operation from this category is the case conversion.
Representatives:
It is present in all the mentioned libraries, except for STL. Being one of the most basic operation, its absence in the Standard Library is problematic at least. Therefore it is part of this proposal.
Other transformations found in the libraries are usually domain-specific and not universally accepted.
We will omit these two categories from the scope of this proposal. Both are rather complex and require more elaboration to give some reasonable results. Separate proposals should be considered for this domain.
Strings are often printed with spaces on either side to improve the readability on the screen. These spaces have usually no semantic value and must be removed prior to processing. This removal is called trimming. The converse operation, when spaces are added to the front or the end of string, is called padding.
Both operations can be found in most of the mentioned libraries.
Representatives:
This functionality is completely missing in the Standard Library, therefore it is part of this proposal.
Even if this category is quite broad there is a relatively small number of algorithms that are used frequently. We have focused on predicates that examine relationship between the two strings. Relation operators (<,=,>) that fall in this category are present in all libraries. Additionally, we have containment checks and advanced forms of comparison.
Representatives:
This area is not sufficiently covered in the Standard Library. The predicates presented here are therefore a useful addition.
Algorithms that do not fall to any of the above categories are usually domain-specific or too narrow-defined. Therefore they are not suitable for the context of this proposal.
While a 'string' is often represented by basic_string class, this is not the only representation. There are several alternatives that are widely used. Examples are FlexString from the Loki library or rope from the STLPort implementation of the C++ standard library. Algorithms presented here are therefore designed to work with generic structures that satisfy algorithm requirements, rather than with a concrete string class.
To accommodate this design, the Range Library (N2068) is used as the interface for accessing the input strings. The Range Library provides uniform access to a container structures. This allows virtually any container to be used with the algorithms presented here. Yet, the algorithms are designed to work efficiently with string-like structures. An implication is that the elements of a string (i.e. characters) are values that can be copied and moved around without additional overhead.
As mentioned earlier, this proposal is based on Boost String Algorithms Library. However, it includes only a subset of the algorithms contained in the Boost library. The part of the library that deals with find and replace functionality is intentionally omitted due to several reasons:
Therefore we have decided to exclude these algorithms from this proposal. Yet, it is possible that they might be a part of another proposal in the future.
There are a few design differences between the proposal and the Boost library as well;
The Range library proposal (N2068) does not contain direct support for handling of the character literals. This support was present in Boost Range Library, but it was deprecated due to the inconsistencies in usage of the built-in arrays. Character literals are typed as C-arrays of char (s) or wchar_t (s). The problematic isssue is the zero terminator at the end. For string processing, this terminator should not be a part of the range, on the other hand, generic array usage requires that the entire array is considered to be a part of range.
A hybrid solution that uses literal helpers is therefore suggested. The default behaviour of range is defined in the range proposal. Algorithms, specifically in the domain of string processing, can override this default by applying as_literal to the input arguments.
Please note, that there is no support planned for null terminated C-strings. Unlike literals, these have type char* or wchar_t*. Usage of plain C-strings is discouraged due to their inherently unsafe nature. A user can still convert C-string to a range prior to usage.
Both helpers convert a character literal to a range. The difference between the two utilities is in the way they handle Char[] literals. as_literal treats it as an ordinary null-terminated string, while as_array uses the array dimensions to determine the length.
In addition, they behave neutrally to the other ranges. That means they return the same range as the input (modulo range wrapping).
In the following example:
char str[]="Hello"; range<char*> lit=as_literal(str); range<char*> arr=as_array(str);
variable lit will delimit elements 'H','e','l','l','o'. arr, on the other end will point to elements 'H','e','l','l','o',0 including the terminating zero.
as_literal constructs a range that delimits the range specified as the parameter. Char[] (s) are considered to be null-terminated strings. strlen is used to determine the length of the string. Other range type are passed unmodified.
Synopsis:
template<typename Range> range<typename range_iterator<Range>::type> as_literal(const Range& r); template<typename Char, size_t sz> range<Char*> as_literal(Char (&arr)[sz])
Effect: Construct a range that delimits the input string.
as_array constructs a range that delimits the range specified as parameter. Char[] literals are considered to be C-arrays. The array dimension is used to determine the length of the string. Other range types are passed unmodified.
Note, this is actually the default behaviour for the range library.
Synopsis:
template<typename Range> range<typename range_iterator<Range>::type> as_array(const Range& r); template<typename Char, size_t sz> range<Char*> as_array(Char (&arr)[sz])
Requires: Range must satisfy the range concept requirements (Range library proposal)
Effect: Constructs a range that delimits the input string.
An algorithm that wants to allow a user to pass a character literal in its parameter needs to use as_literal to convert the input parameter.
Example:
template<typename Range> bool AlgoImpl(const Range& r); template<typename Range> bool Algo(const Range& r) { return AlgoImpl(as_literal(r)); }
Alternatively, if a user requires to pass an argument as an array rather then literal, she can override the as_literal conversion in the algorithm by using as_array.
Example:
char arr[6]={ 'h', 'e', 'l', 'l', 'o' }; Algo(as_array(arr));
Unless stated otherwise, types of arguments passed to the template functions must satisfy the requirements defined by the concept of the same name as the corresponding template argument. The used concepts are either defined in the Standard Library Reference or they are part of the Range Library Proposal (N2068).
In addition, all parameters of Range type shall be converted using as_literal helper as described above.
These algorithms perform case-conversion on all characters in the string.
Synopsis:
template<typename CopyableRange> CopyableRange to_lower(const CopyableRange& s, const locale& loc=locale());
Returns: A modification of s such that each element x == tolower(y, loc) where y is a corresponding element in the original s. [2]
template<typename CopyableRange> CopyableRange&& to_lower(CopyableRange&& s, const locale& loc=locale());
Returns: to_lower( CopyableRange(s), loc)
Synopsis:
template<typename CopyableRange> CopyableRange to_upper(const CopyableRange& s, const locale& loc=locale());
Returns: A modification of s such that each element x == toupper(y, loc) where y is a corresponding element in the original s. [2]
template<typename Sequence> CopyableRange&& to_upper(CopyableRange&& s, const locale& loc=locale());
Returns: to_upper( CopyableRange(s), loc)
[2] | (1, 2) An implicit requirement is implied from using toupper and tolower. ctype must be specialised for range_value<CopyableRange>::type and it must be imbued in the specified locales. |
These algorithms perform containment checks of various types and the string comparison.
Synopsis:
template<typename Range1, typename Range2> bool starts_with(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool starts_with(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool istarts_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
The predicate checks if r2 is a prefix of r1. In other words, it holds when size(r1)>=size(r2) and each element from r2 is equal to corresponding element in r1, beginning from the start of r1.
The first form uses operator== for comparison, second one uses the given predicate and the last one is comparing using the expression tolower(x, loc)==tolower(y, loc)
Example: Function IsURL() checks if the path denotes an url.
bool IsURL(const string& strPath) { return istarts_with(strPath, "http://") || istarts_with(strPath, "https://") || istarts_with(strPath, "ftp://"); }
Synopsis:
template<typename Range1, typename Range2> bool ends_with(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool ends_with(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool iends_with(const Range1& r1, const Range2& r2, const locale& loc=locale());
Example: Function IsExecutable() checks if the extension of the file name denotes an executable.
bool IsExecutable(const string& strPath) { return iends_with(strPath, ".com") || iends_with(strPath, ".exe") || iends_with(strPath, ".bat"); }
Synopsis:
template<typename Range1, typename Range2> bool contains(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool contains(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool icontains(const Range1& r1, const Range2& r2, const locale& loc=locale());
The predicate checks if r2 is contained within r1. It returns true if size(r1)>=size(r2) and there exists an index n such that r2[i]==r1[i+n] for all indexes i in r1,
The first form uses operator== for comparison, second one uses the given predicate and the last one is comparing using expression tolower(x, loc)==tolower(y, loc)
The equals algorithm is actually a two-way version of equal available in the <algorithm> header. equal does not check if both operands have the same length. Moreover, it requires that the second argument is at least as long as the first one. Algorithm presented here fills this gap. In addition, case-insensitive comparison is provided.
Synopsis:
template<typename Range1, typename Range2> bool equals(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool equals(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool iequals(const Range1& r1, const Range2& r2, const locale& loc=locale);
The predicate checks if r1 and r2 are equal element wise. It returns true size(r1)==size(r2) and each element r1[n] from r1 is equal to corresponding elements r2[n] from r2.
First form uses operator== for comparison, second one uses the given predicate and the last one is comparing using expression tolower(x, loc)==tolower(y, loc)
This algorithm has functionality equivalent to lexicographical_compare algorithm already present in the standard library. It provides a simpler, range-oriented interface and a case-insensitive version.
Synopsis:
template<typename Range1, typename Range2> bool lexicographical_compare(const Range1& r1, const Range2& r2); template<typename Range1, typename Range2, typename Pred> bool lexicographical_compare(const Range1& r1, const Range2& r2, Pred pred); template<typename Range1, typename Range2> bool ilexicographical_compare(const Range1& r1, const Range2& r2, const locale& loc=locale());
Returns: true if r1 is lexicographically less than r2, false otherwise.
Synopsis:
template<typename Range, typename Pred> bool all(const Range& r1, Pred pred);
Returns: Returns true if for each element x, pred(x)==true, false otherwise.
Synopsis:
template<typename Range, typename Pred> bool none(const Range& r1, Pred pred);
Returns: Returns true if for each element x, pred(x)==false, false otherwise.
[3] | (1, 2) The names all and none might be considered too general. Yet they describe the algorithms very well and we haven't found a better variant so far. If this is considered a problem, alternative names might be used. |
Trimming algorithms remove unnecessary spaces from the beginning and the end of a string. [5]
Synopsis:
template<typename CopyableRange> CopyableRange trim_left(const CopyableRange& s, const locale& loc=locale()); template<typename CopyableRange, typename Pred> CopyableRange trim_left_if(const Sequence& s, Pred pred);
Return: A copy of the input range where all leading spaces (characters satisfying pred) have been removed.
template<typename Sequence> Sequence&& trim_left(Sequence&& s, const locale& loc=locale()); template<typename Sequence> Sequence&& trim_left_if(Sequence&& s, Pred pred);
Return: A modified version of s where all leading spaces (characters satisfying pred) have been removed.
Synopsis:
template<typename CopyableRange> CopyableRange trim_right(const CopyableRange& s, const locale& loc=locale); template<typename CopyableRange, typename Pred> CopyableRange trim_right_if(const Sequence& s, Pred pred);
Returns: A copy of the input range where all trailing spaces (characters satisfying pred) have been removed.
template<typename Sequence> Sequence&& trim_right(Sequence&& s, const locale& loc=locale); template<typename Sequence> Sequence&& trim_right_if(Sequence&& s, Pred pred);
Returns: A modified version of s where all trailing spaces (characters satisfying pred) have been removed.
Synopsis:
template<typename CopyableRange> CopyableRange trim(const CopyableRange& s, const locale& loc=locale()); template<typename CopyableRange, typename Pred> CopyableRange trim_if(const Sequence& s, Pred pred);
template<typename Sequence> Sequence&& trim(Sequence&& s, const locale& loc=locale()); template<typename Sequence> Sequence&& trim_if(Sequence&& s, Pred pred);
[5] | The _if suffix has to be added to the variants that have the predicate argument, to avoid name collisions. Since the predicate is a generic template argument, it cannot be distinguished from loc argument during the template instantiation stage. |
[6] | (1, 2) Alternative names that might be considered are trim_start and trim_end. This would be more consistent with starts_with and ends_with, however 'TrimLeft/Right' is more often used term in currently existing libraries. |
As the opposite to trimming, padding algorithms add a certain number of spaces to the one or both ends of the string, so that the string has the specified length.
All algorithms presented here can fill the gap with spaces or using the specified pattern. The pattern is used as follows: To fill the gap of n elements, pattern is repeated so that a sequence with at least n elements is created. After that, beginning of this sequence is used to fill the gap.
template<typename Sequence> Sequence pad_left(const Sequence& s, unsigned int len); template<typename Sequence, typename Range> Sequence pad_left(const Sequence& s, unsigned int len, const Range& pattern);
template<typename Sequence> Sequence&& pad_left(Sequence&& s, unsigned int len); template<typename Sequence> Sequence&& pad_left(Sequence&& s, unsigned int len, const Range& pattern);
template<typename Sequence> Sequence pad_right(const Sequence& s, unsigned int len); template<typename Sequence, typename Range> Sequence pad_right(const Sequence& s, unsigned int len, const Range& pattern);
template<typename Sequence> Sequence&& pad_right(Sequence&& s, unsigned int len); template<typename Sequence> Sequence&& pad_right(Sequence&& s, unsigned int len, const Range& pattern);
template<typename Sequence> Sequence pad(const Sequence& s, unsigned int len); template<typename Sequence, typename Range> Sequence pad(const Sequence& s, unsigned int len, const Range& pattern);
template<typename Sequence> Sequence&& pad(Sequence&& s, unsigned int len); template<typename Sequence> Sequence&& pad(Sequence&& s, unsigned int len, const Range& pattern);
[7] | (1, 2) The filling characters are distributed evenly to the front and the back. If the difference is of odd length, the front gap is larger by one character. Both gaps are filled independently as if pad_left and pad_right were applied in sequence. |
The algorithms should be placed into the <stringalgorithm> header. Although <algorithm> header might be also considered as an option, we have chosen to use a separate header due to dependency on the locales.
Since the scope of literal helpers is larger than this proposal, they shall be placed together with the Range Library.
Even if this issue is abandoned in the proposal, it could be considered. To support char* and wchar_t* null-terminated strings, the following extensions must be incorporated: