Proposal for new string algorithms in C++0x

Author:	Pavol Droba <droba@topmail.sk>
Co-Author:	Thorsten Ottosen <nesotto@cs.aau.dk>
Number:	N1872=05-0132
Date:	2005-08-26

Table of Contents

1 Abstract
2 Motivation
- 2.1 Existing practice
3 Impact on the standard
- 3.1 Design issues
  - 3.1.1 String representation
  - 3.1.2 Differences from the Boost String Algorithms Library
4 Algorithms
- 4.1 Case Conversion
  - 4.1.1 to_lower
  - 4.1.2 to_upper
- 4.2 Predicates
  - 4.2.1 starts_with
  - 4.2.2 ends_with
  - 4.2.3 contains
  - 4.2.4 equals
  - 4.2.5 lexicographical_compare
  - 4.2.6 all
  - 4.2.7 none
- 4.3 Trimming
  - 4.3.1 trim_left
  - 4.3.2 trim_right
  - 4.3.3 trim
5 Predicates
- 5.1 Classification predicates
  - 5.1.1 Predicate facade
  - 5.1.2 is_classified
  - 5.1.3 is_any_of
  - 5.1.4 is_from_range
- 5.2 Combinators
  - 5.2.1 and
  - 5.2.2 or
  - 5.2.3 not
- 5.3 Comparison predicates
  - 5.3.1 is_equal
  - 5.3.2 is_iequal
  - 5.3.3 is_less
  - 5.3.4 is_iless
  - 5.3.5 is_not_greater
  - 5.3.6 is_not_igreater
6 Literal helpers
- 6.1 as_literal
- 6.2 as_array
7 Discussion and open issues
8 References

1 Abstract

This proposal introduces several string manipulating algorithms for addition into the C++ standard library. Additionaly the paper also proposes a set of predicate classes that might be used together with the algorithms mentioned here.

2 Motivation

String processing is a part of almost every non-trivial program. There is always a need to process some input, extract some information from a text or format a textual output. Although the range of operations that need to be performed with strings is almost unlimited, some of them are very common. Many of these tend to be reinvented continuously by C++ programmers.

The C++ standard library supports string processing mainly by the basic_string class. In addition the <algorithm> header contains several algorithms that might by be applied in the string processing tasks, i.e. search, replace, etc. Compared to other programming languages like perl, the options are very limited.

This paper tries to fill this deficiency by introducing some of the often used algorithms into the standard library.

2.1 Existing practice

The proposal is based on the Boost String Algorithms Library. The library is in Boost since the version 1.32 and it is well accepted (See references).

Algorithms mentioned here are also present in some form in many general-purpose C++ libraries. Among others CString from MFC or QString from Qt. They are also standard operations in scripting languages like Perl, Python or Ruby.

3 Impact on the standard

The proposal is directly dependant on other two proposed extensions. Range Library (N1871) and rvalue reference proposal (N1770). No other changes in the core language or in the existing parts of the standard library are required.

3.1 Design issues

3.1.1 String representation

While a 'string' if often represented by basic_string class, this is not the only representation. There are several alternatives that are widely used. Examples are FlexiString from the Loki library or rope from the STLPort implementation of the C++ standard library. Algorithms presented here are therefore designed to work with generic structures that satisfy algorithm requirements, rather than with a concrete string class.

To accommodate this design, the Range Library (N1871) is used as an interface for accessing the input strings. The Range Library provides a uniform access to a container structures. This allows virtually any container to be used with the algorithms presented here. Yet the algorithms are designed to work efficiently with string-like structures. An implication is that the elements of a string (a.k.a characters) are values that can be copied and moved around without additional overhead.

3.1.2 Differences from the Boost String Algorithms Library

As mentioned earlier, this proposal is based on Boost String Algorithms Library. However it includes only a subset of the algorithms contained in the Boost library. Part of the library that deals with find and replace functionality is intentionally omitted due to several reasons:

This functionality is quite complex and requires several facilities that should be considered separately.
There are several volatile topics that have not been satisfactory solved, therefore they are not ready for standardization.
There is a possible overlap with Regex proposal that is part of TR1.

Therefore we have decided to exclude these algorithms from this proposal. Yet it is possible that they might be a part of another proposal in the future.

There are few design differences between the proposal and the Boost library as well.

Mutable variants of algorithms were removed in favor of move semantics (using constructs form the proposal N1770). Now the algorithms usually come in two flavors, a copy (using const &) and a move (using &&) variant. We consider this to be another evolutionary step toward functional-oriented programming in C++. During the review of Boost String Algorithms Library this topic was intensively debated. A strong opinion was expressed in favor of functional approach, however without the move semantics support, it is not possible to implement some of the algorithms in an efficient manner. Therefore mutable version were retained as a primary interface, accompanied by copy variants.
The definitions are tied to the Range Library proposal (N1871) rather than to the Boost Range Library.

4 Algorithms

In the following n is an unsigned integer, s[n] denotes n-th element of the sequence s, v1 and v2 are variables of type range_value<Range>::type or elements of range r1 or r2 respectively. size(s) evaluates to the number of elements in s and finally result denotes the value returned by the algorithm.

4.1 Case Conversion

These algorithms perform case-conversion on all characters in the string.

4.1.1 to_lower

Synopsis:

template<typename CopyableRange>
CopyableRange to_lower(const CopyableRange& s, const locale& loc=locale());

template<typename CopyableRange>
CopyableRange&& to_lower(CopyableRange&& s, const locale& loc=locale());

Requires: CopyableRange satisfies the requirements of the copyable range (N1871) [1].

Effects: Converts elements in the input sequence to lower-case using the specified locales.

Returns: Modified copy of the input

Postcondition: for each n in [0, size(s)) result[n] == tolower(s[n], loc)

4.1.2 to_upper

Synopsis:

template<typename CopyableRange>
CopyableRange to_upper(const CopyableRange& s, const locale& loc=locale());

template<typename Sequence>
CopyableRange&& to_upper(CopyableRange&& s, const locale& loc=locale());

Requires: CopyableRange satisfies the requirements of the copyable range (N1871) [1].

Effects: Converts elements in the input sequence to upper-case.

Returns: Modified copy of the input.

Postcondition: for each n in CopyableRange [0, size(s)) result[n] == toupper(s[n], loc)

[1]	(1, 2) An implicit requirement is implied from using `tolower`. `ctype` must be specialized for `range_value<CopyableRange>::type` and it must be imbued in the specified locales.

4.2 Predicates

These algorithms perform containment checks of various types and string comparison.

4.2.1 starts_with

Synopsis:

template<typename Range1, typename Range2, typename Pred>
bool starts_with(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool starts_with(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool istarts_with(const Range1& r1, const Range2& r2, const locale& loc=locale());

Requires:

Range1, Range2 satisfies the requirements of the range concept (N1871).
(Form 1) the expression pred(v1, v2) is valid.
(Form 2,3) the expression v1==v2 is valid [2].

Effects: Checks whether r2 is a prefix of r1 (Form 3 ignoring the case)

Returns:

true if size(r1)>=size(r2) && for each n<size(r2)
- (Form 1) pred(r1[n],r2[n])
- (Form 2) r1[n]==r2[n]
- (Form 3) tolower(r1[n], loc)==tolower(r2[n], loc)
false otherwise

Example: Function IsURL() checks if the path denotes an url.

bool IsURL(const string& strPath)
{
      return 
              istarts_with(strPath, "http://") ||
              istarts_with(strPath, "https://") ||
              istarts_with(strPath, "ftp://");
}

4.2.2 ends_with

Synopsis:

template<typename Range1, typename Range2, typename Pred>
bool ends_with(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool ends_with(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool iends_with(const Range1& r1, const Range2& r2, const locale& loc=locale());

Requires:

Range1, Range2 satisfies the requirements of the range concept (N1871).
(Form 1) the expression pred(v1, v2) is valid
(Form 2,3) the expression v1==v2 is valid [2].

Effects: Checks whether r2 is a suffix of r1 (Form 3 ignoring the case)

Returns:

true if size(r1)>=size(r2) && for each 0<n<=size(r2)
- (Form 1) pred(r1[size(r1)-n],r2[size(r2)-n])
- (Form 2) r1[size(r1)-n]==r2[size(r2)-n]
- (Form 3) tolower(r1[size(r1)-n], loc)==tolower(r2[size(r2)-n], loc)
false otherwise

Example: Function IsExecutable() checks if the extension of the file name denotes an executable.

bool IsExecutable(const string& strPath)
{
  return
    iends_with(strPath, ".com") ||
    iends_with(strPath, ".exe") ||
    iends_with(strPath, ".bat");
}

4.2.3 contains

Synopsis:

template<typename Range1, typename Range2, typename Pred>
bool contains(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool contains(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool icontains(const Range1& r1, const Range2& r2, const locale& loc=locale());

Requires:

Range1, Range2 satisfies the requirements of the range concept (N1871).
(Form 1) the expression pred(v1, v2) is valid.
(Form 2,3) the expression v1==v2 is valid [2].

Effects: Checks whether r2 is a substring of r1 (Form 3 ignoring the case)

Returns:

true if size(r1)>=size(r2) && there exists i such that for each n<size(r2)
- (Form 1) pred(r1[i+n],r2[i+n])
- (Form 2) r1[i+n]==r2[i+n]
- (Form 3) tolower(r1[i+n], loc)==tolower(r2[i+n], loc)
false otherwise

4.2.4 equals

Equals algorithm is actually a two-way version of equal available in <algorithm> header. equal does not check if both operands have the same length. Moreover, it requires that the second argument is at least as long as the first one. Algorithm presented here fills this gap. In addition, case-insensitive comparison is provided.

Synopsis:

template<typename Range1, typename Range2, typename Pred>
bool equals(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool equals(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool iequals(const Range1& r1, const Range2& r2, const locale& loc=locale);

Requires:

Range1, Range2 satisfies the requirements of the range concept (N1871).
(Form 1) the expression pred(v1, v2) is valid.
(Form 2,3) the expression v1==v2 is valid [2].

Effects: Checks whether r1 and r2 are element-wise equal (Form 3 ignoring the case)

Returns:

true if size(r1)==size(r2) && for each n<size(r2)
- (Form 1) pred(n],r2[n])
- (Form 2) r1[n]==r2[n]
- (Form 3) tolower(r1[n], loc)==tolower(r2[n], loc)
false otherwise

4.2.5 lexicographical_compare

This algorithm has functionality equivalent to lexicographical_compare algorithm already present in the standard library. It provides simpler, range-oriented interface and case-insensitive version.

Synopsis:

template<typename Range1, typename Range2, typename Pred>
bool lexicographical_compare(const Range1& r1, const Range2& r2, Pred pred);

template<typename Range1, typename Range2>
bool lexicographical_compare(const Range1& r1, const Range2& r2);

template<typename Range1, typename Range2, typename Pred>
bool ilexicographical_compare(const Range1& r1, const Range2& r2, const locale& loc=locale());

Requires:

Range1, Range2 satisfies the requirements of the range concept (N1871).
(Form 1) the expression pred(v1, v2) is valid.
(Form 2,3) expression v1<v2 is valid [2].

Effects: Performs the lexicographical comparison.

Returns: true if r1 is lexicographically less then r2, false otherwise.

Notes:: If the two sequences have the same number of elements and their corresponding elements are equivalent, then neither sequence is lexicographically less than the other. If one sequence is a prefix of the other, then the shorter sequence is lexicographically less than the longer sequence. Otherwise, the lexicographical comparison of the sequences yields the same result as the comparison of the first corresponding pair of elements that are not equivalent.

4.2.6 all [3]

Synopsis:

template<typename Range, typename Pred>
bool all(const Range& r1, Pred pred);

Requires: Range satisfies the requirements of the range concept (N1871).

Effects: Check if all the elements of the range satisfy the condition of the predicate pred.

Returns: true if for each n<size(r1) pred(r1[n]), false otherwise

Example: Check if all characters in the string are lower case. [4]

if(all(str, is_lower()))
{ 
  // All characters are lower case
}

4.2.7 none [3]

Synopsis:

template<typename Range, typename Pred>
bool none(const Range& r1, Pred pred);

Requires: Range satisfies the requirements of the range concept (N1871).

Effects: Check if none of the elements in the range satisfy the condition of the predicate pred.

Returns: true if for each n<size(r1) !pred(r1[n]), false otherwise

[2]	(1, 2, 3, 4, 5) An implicit requirement is implied from using `tolower`. `ctype` must be specialized for `range_value<Range>::type` and it must be imbued in the specified locales.

[3]	(1, 2) The names `all` and `none` might be considered too general. Yet they describe the algorithms very well and we haven't found a better variant so far. If this will be considered a problem, alternative names might be used.

[4]	`is_lower` predicate is presented later in this proposal.

4.3 Trimming

Trimming algorithms remove unnecessary spaces from the beginning and the end of a string.

4.3.1 trim_left [5]

Synopsis:

template<typename Sequence>
Sequence trim_left(const Sequence& s, const locale& loc=locale());

template<typename Sequence>
Sequence&& trim_left(Sequence&& s, const locale& loc=locale());

template<typename Sequence, typename Pred>
Sequence trim_left_if(const Sequence& s, Pred pred);

template<typename Sequence>
Sequence&& trim_left_if(Sequence&& s, Pred pred);

Requires:

Sequence satisfies the requirement of sequence (par. 23.1.1) [6].
The expression pred(v1) is valid.

Effects: Removes all leading spaces (or characters satisfying pred) from the beginning of the sequence

Returns: Modified copy of the input

Postcondition:

(Form 1) !isspace(result[0], loc)
(Form 2) !pred(result[0])

4.3.2 trim_right [5]

Synopsis:

template<typename Sequence>
Sequence trim_right(const Sequence& s, const locale& loc=locale);

template<typename Sequence>
Sequence&& trim_right(Sequence&& s, const locale& loc=locale);

template<typename Sequence, typename Pred>
Sequence trim_right_if(const Sequence& s, Pred pred);

template<typename Sequence>
Sequence&& trim_right_if(Sequence&& s, Pred pred);

Requires:

Sequence satisfies the requirement of sequence (par. 23.1.1) [6].
The expression pred(v1) is valid.

Effects: Removes all trailing spaces (or characters satisfying pred) from the end of the sequence

Returns: Modified copy of the input

Postcondition:

(Form 1) !isspace(result[size(result)-1], loc)
(Form 2) !pred(result[size(result)-1])

4.3.3 trim

Synopsis:

template<typename Sequence>
Sequence trim(const Sequence& s, const locale& loc=locale());

template<typename Sequence>
Sequence&& trim(Sequence&& s, const locale& loc=locale());

template<typename Sequence, typename Pred>
Sequence trim_if(const Sequence& s, Pred pred);

template<typename Sequence>
Sequence&& trim_if(Sequence&& s, Pred pred);

Requires:

Sequence satisfies the requirement of sequence (par. 23.1.1) [6].
The expression pred(v1) is valid.

Effects: Equivalent functionality to

(Form 1) trim_left(trim_right(s))
(Form 2) trim_left_if(trim_right_if(s, pred), pred)

Returns: Modified copy of the input

Postcondition:

(Form 1) !isspace(result[0], loc) && !isspace(result[size(result)-1], loc)
(Form 2) !pred(result[0]) && !pred(result[size(result)-1])

[5]	(1, 2) Alternative names that might be considered are `trim_start` and `trim_end`. This would be more consistent with `starts_with` and `end_withs`, however 'TrimLeft/Right' is more often used term in currently existing libraries.

[6]	(1, 2, 3) An implicit requirement is implied from using `tolower`. `ctype` must be specialized for `range_value<CopyableRange>::type` and it must be imbued in the specified locales.

5 Predicates

Two categories of predicates are given.

Classification predicates are used to recognize if a character belongs to a particular character class. These predicated can be used in algorithms like all or trim_if.
Comparison predicates perform comparison of two characters. They are targeted to be used with algorithm that require character-level comparison like starts_with, equals and etc.

It is noteworthy that the predicates are not staticaly bound to a character type that is to be processed, rather the operation it templated. This approach simplifies the usage of predicates, since the types are automaticaly infered from the input without the need to be explicitly specified by a user.

5.1 Classification predicates

5.1.1 Predicate facade

template<typename Derived>
struct predicate_facade {};

The purpose of the predicate_facade class is to tag classification predicates, so that they can be used together with the predicate combinators. Without such a facility it would not be possible to avoid ambiguities with other logical operators.

Every predicate should be therefore derived from predicate_facade.

5.1.2 is_classified

Synopsis:

class is_classified_pred : public predicate_facade<is_classified_pred>
{
public:
  // Constructor
  is_classified_pred(ctype_base::mask Type, locale const & Loc = locale());

  // Operation
  template<typename Char> bool operator()(Char Ch) const;
};

is_classified_pred is_classified(ctype_base::mask Type, const locale& Loc=locale());
is_classified_pred is_space(const locale& Loc=locale());
is_classified_pred is_alnum(const locale& Loc=locale());
is_classified_pred is_alpha(const locale& Loc=locale());
is_classified_pred is_cntrl(const locale& Loc=locale());
is_classified_pred is_digit(const locale& Loc=locale());
is_classified_pred is_graph(const locale& Loc=locale());
is_classified_pred is_lower(const locale& Loc=locale());
is_classified_pred is_print(const locale& Loc=locale());
is_classified_pred is_punct(const locale& Loc=locale());
is_classified_pred is_upper(const locale& Loc=locale())
is_classified_pred is_xdigit(const locale& Loc=locale());

Class is_classified_pred is a predicate that checks whether a character belongs to a character class specified in constructor. Character classes are defined in the ctype_base class. The given locales are used to perform the actual classification.

The functions is_* are the preferred form of construction for the is_classified_pred class.

5.1.3 is_any_of

Synopsis:

template<typename Char> 
class is_any_of_pred : public predicate_facade<is_any_of_pred<Char> >
{
public:
  // Constructor
  template<typename Range> is_any_of_pred(const Range& r);

  // Operation
  template<typename Char2> bool operator()(Char2 c) const;
};

template<typename Range>
is_any_of_pred<typename range_value<Range>::type > is_any_of(const Range& r);

is_any_of_pred predicate checks whether a character is member of a given set. A range specified at the construction is used to initialize the control set.

is_any_of constructs the is_from_range_pred predicate.

5.1.4 is_from_range

Synopsis:

template<typename Char> 
class is_from_range_pred : public predicate_facade<is_from_range_pred<Char> >
{
public:
  // Constructor
  is_from_range_pred(Char from, Char to);

  // Operation
  template<typename Char2> bool operator()(Char2 c) const;
};

template<typename Char>
is_from_range_pred<Char> is_from_range(Char from, Char to);

is_from_range_pred checks whether a character belongs to a specified range of characters. In other words, it checks whether from <= c <= to holds.

Precondition: from <= to

5.2 Combinators

Combinators allow to create an expression consisting of several classification predicates.

Example:

is_space() || is_any_of(as_literal("\"_"));

The result of this expression is a predicate that recognizes spaces, quotes and underscores.

5.2.1 and

Synopsis:

template<typename Pred1, typename Pred2>
class and_pred : public predicate_facade< and_pred<Pred1,Pred2> >
{
public:
  // Constructor
  and_pred(Pred1 p1, Pred2 p2);

  // Operation
  template<typename Char> bool operator()(Char c) const;
};


template<typename Pred1, typename Pred2>
and_pred<Pred1, Pred2> operator&&(Pred1 p1, Pred2 p2);

Constructs a predicate that is a logical and of predicates p1 and p2.

5.2.2 or

Synopsis:

template<typename Pred1, typename Pred2>
class or_pred : public predicate_facade<or_pred<Pred1,Pred2> >
{
public:
  // Constructor
  or_pred(Pred1 p1, Pred2 p2);

  // Operation
  template<typename Char> bool operator()(Char c) const;
};

template<typename Pred1, typename Pred2>
or_pred<Pred1, Pred2> operator||(Pred1 p1, Pred2 p2);

Constructs a predicate that is a logical or of predicates p1 and p2.

5.2.3 not

Synopsis:

template<typename Pred>
class not_pred : public predicate_facade<not_pred<Pred> >
{
public:
  // Constructor
  not_pred(Pred p);

  // Operation
  template<typename Char> bool operator()(Char c) const;
};

template<typename Pred>
not_pred<Pred> operator!(Pred p);

Constructs a predicate that is a logical negation of predicate p.

5.3 Comparison predicates

The comparison predicates are primarily designed to be used by algorithms that perform character-level comparison during their computation [7].

Unlike the functional objects that are already present in the standard library, these predicates have a templated function operator. Therefore they are easier to use since it is not required to explicitly specify the argument type in the construction. Rather, C++ type matching is used.

So instead of writing:

string s1, s2;
...
equals(s1, s2, equal_to<char>());

a user can write

equals(s1, s2, is_equal());

The difference becomes even more apparent when the input types are more complex.

In addition, the old function objects do not directly support comparison between different argument types (unless they are implicitly convertible to each other).

The proposed predicates do not support type identification facilities provided by unary_function and binary_function. This is primarily because it is not possible to support it with a dynamic operation type; secondarily, these facilities (unary_function, etc.) becomes obsolete when auto and decltype will become a part of standard.

5.3.1 is_equal

Synopsis:

class is_equal
{
public:
  // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& arg1, const T2& arg2) const
};

Effects: operator() returns arg1==arg2

5.3.2 is_iequal

Synopsis:

class is_iequal
{
public:
  // Constructor
  is_iequal(const locale& loc=locale())
  // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& Arg1, const T2& Arg2) const
};

Effects: operator() returns tolower(arg1,loc)==tolower(arg2,loc)

5.3.3 is_less

Synopsis:

class is_less
{
public:
  // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& arg1, const T2& arg2) const
};

Effects: operator() returns arg1<arg2

5.3.4 is_iless

Synopsis:

class is_iless
{
public:
  // Constructor
  is_iless(const locale& loc=locale())
   // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& Arg1, const T2& Arg2) const
};

Effects: operator() returns tolower(arg1,loc)<tolower(arg2,loc)

5.3.5 is_not_greater

Synopsis:

class is_not_greater
{
public:
  // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& Arg1, const T2& Arg2) const
};

Effects: operator() returns arg1<=arg2

5.3.6 is_not_igreater

Synopsis:

class is_not_igreater
{
public:
  // Constructor
  is_not_igreater(const locale& loc=locale())
  // Operator
  template<typename T1, typename T2>
  bool operator ()(const T1& Arg1, const T2& Arg2) const
};

Effects: operator() returns tolower(arg1,loc)<=tolower(arg2,loc)

[7]

In this proposal we suggested to add only operators for =, < and <=. These operators should be sufficient for the purposes they are targeted. Reverse operations can be achieved by simply revering the arguments in the algorithm call.

Yet it is easy to add the negation operator or simply complete the set if such a requirement arises.

6 Literal helpers

The Range library does not have support for character literals and c-style strings. Two utilities are provided to allow these to be used with string algorithms.

The difference between the two utilities is in the way they handle Char[] literals. as_literal treats it as an ordinary null-terminated string, while as_array uses the array dimensions to determine the length.

In the following example:

char str[]="Hello";
range<char*> lit=as_literal(str);
range<char*> arr=as_array(str);

variable lit will delimit elements 'h','e','l','l','o'. arr on the other end will point to elements 'h','e','l','l','o',0 including the terminating zero.

To pass a literal to an algorithm, one of the adapters must be used

if( starts_with(str, as_literal("start") )
{
  // Do something
}

6.1 as_literal

as_literal constructs a range that delimits the range specified as parameter. Char* and Char[] are considered to be null-terminate strings. strlen is used to determine the length of the string. Other range type are passed unmodified.

Synopsis:

template<typename Range>
range<typename range_value<Range>::type> as_literal(const Range& r);

template<typename Char>
range<Char*> as_literal(Char* pch);

template<typename Char, size_t sz>
range<Char*> as_literal(Char (&arr)[sz])

Requires:

Range must satisfy the range concept requirements (N1871).
Char* must point to a null-terminated string.
char_traits<Char> is defined.

Effect: Construct a range that delimits the input string.

6.2 as_array

as_array constructs a range that delimits the range specified as parameter. Char[] literals are considered to be C-arrays. The array dimension is used to determine the length of the string. Other range types are passed unmodified.

Synopsis:

template<typename Range>
range<typename range_value<Range>::type> as_array(const Range& r);

template<typename Char, size_t sz>
range<Char*> as_array(Char (&arr)[sz])

Requires: Range must satisfy the range concept requirements (Range library proposal)

Effect: Constructs a range that delimits the input string.

7 Discussion and open issues

7.1 String affinity

This issue is rather philosophical, but it might have an impact on several issues discussed later. The proposal declares that it is targeted mainly for string processing. Algorithms are designed with this as a primary goal in mind. However some of them can be also used outside of this scope.

So the question is if the string affinity should be strongly declared or if on the other hand, the usage outside of this scope should be encouraged.

7.2 Placement

This proposal does not explicitly state the header files where the algorithms should be located. There are several possible options.

Distributed approach

No new headers will be created. Rather, the entities from this proposal will be distributed among the existing headers. Algorithms will go into <algorithm> header, predicates to <functional>.

This approach is a good candidate if string affinity (discussed in as the first issue) is not strongly declared.
Monolithic approach

New header will be created, named <stringalgo> that will encompass all the entities presented here. Optionally the entities can be added to <string> header.

Again, this approach is the natural choice if the string affinity of the algorithms is strengthened.

7.3 Range Library and char literals handling

During the writing of this proposal, the Range Library proposal had not been finished. There are some issues that arise due to the interaction with this library. These issues will have to be solved before adopting these two proposals together.

Probably the most problematic is the support for character literals. There are two different views on the literal:

As Null-terminated string
As C-array

The compiler treats literals as C-arrays with the length that includes the terminating zero. In contrast, when using 'null-terminated string' view, zero at the end is only a terminator and is not considered to be a part of the string. Unfortunately, there is no facility in the language that can be used to specify this intention. Therefor the user must know what the algorithm expects and provide the input in the required form.

Prior versions of Boost Range Library contain a special handling for character literal. char[] and wchar_t[] were treated as null-terminated strings. There was also support for char* and wchar_t*. Later this special handling was considered dangerous, since the other C-array types were handled differently (using the array size for determining the boundaries). Current proposal supports only C-arrays and it does so uniformly.

However, this is not natural for string processing. For example contains(str, "hello") will look for 'h','e','l','l','o',0 rather then for "hello".

We suggest two possible solutions: explicit and implicit.

7.3.1 Explicit solution

As the name suggests, all literal handling should be explicit. When using a literal as_literal of as_array will have to be used. To avoid problems arising from unexpected behaviour, range library should be adapted to disallow implicit usage of C-arrays.

For example:

starts_with(str, as_literal("hello"));

char arr[]={'b','y','e'};
starts_with(str, as_array(arr));

The advantage of this approach is clarity and correctness. It is not possible to confuse a C-array and a literal, because the true intention must be specified explicitly. Passing an argument directly will generate a compiler error.

Disadvantages are mainly of cosmetic nature. Code becomes more verbose. Using a manipulator at every place might become tedious.

7.3.2 Implicit solution

In this solution, Range Library would support C-arrays directly. However, algorithms will be able to provide support for literals if appropriate. as_literal is designed in such a way that it processes only null-terminated strings and C-arrays. For all other range types it is transparent, acting as an identity function. Therefore it is possible to write an algorithm as follows:

template<typename Range>
bool AlgoImpl(const Range& r);

template<typename Range>
bool Algo(const Range& r)
{
  return AlgoImpl(as_literal(r));
}

This simple construct extends the set of accepted types by character literals, even when there is no direct support for them in the Range Library. In addition null-terminated strings (char* and wchar_t* and character based C-arrays will be treated as C-strings.

When this handling is inappropriate, user can explicitly cast the input to array using as_array construct:

char arr[]={'b','y','e'};
Algo(as_array(arr));

as_array will convert the input to range. Because that is not a type that as_literal directly processes, it will be passed through unchanged.

This way, the algorithms will need to be clearly marked, as to what types of arguments they accept and how they process them in the documentation. String affinity of algorithms will be strengthened.

The advantage of this approach lies mainly in the simplification of code from the user perspective. In majority of cases the user will not be forced to make explicit specification, implicit handling will work just fine.

On the other hand, there remains the possibility for errors. Different algorithms will use the same arguments in different ways. Such behaviour might be confusing.

7.4 Selection of algorithms

The choice of algorithms is not based on some rigorous study, rather it comes from programming experience and various discussions. Even if we believe that the selection covers most essential ones, there is always place for improvements and possible additions.

This topic is therefore open for discussion.

8 References

Boost String Algorithms Library, http://www.boost.org/doc/html/string_algo.html
Regular Expressions proposal, http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2003/n1429.htm
RValue reference proposal (N1770) and related documents, http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1770.html
Range Library proposal, http://www.open-std.org/JTC1/SC22/WG21/docs/papers/2005/n1871.html
Boost Range Library, http://www.boost.org/libs/range/index.html