Document number: P0540R2
Project: Programming Language C++
Audience: Library Evolution Working Group
Laurent NAVARRO <ln@altidev.com>
Date: 2018-04-29
Split a string in multiple string based on a separator and the reverse operation to aggregate a collection of string with separator are quite common operations
, but there's no standardized easy to use solutions in the existing std::basic_string
and the proposed std::basic_string_view
class.
split("C++/is/fun","/") => ["C++","is","fun"] join(["C++","is","fun"],"/") => "C++/is/fun"
The purpose of this simple proposal is to cover this miss.
Theses features are available in the standard string class of the following languages : D, Python, Java, C#, Go, Rust and some others
This proposal is a pure library extension. It does not require changes in the language core itself.
It does require adding new method to std::basic_string_view
class and std::basic_string
class (or not if implemented only in std::basic_string_view
).
Or just a function add in algorithms if this option is preferred.
It has been implemented in standard C++ 17.
Several options have been discussed in this discussion [1], bellows a summary of the various discussed options. As several alternative has been discussed, we let the committee choose which option is privileged.
Probably the simplest option is to add method to std::basic_string
and std::basic_string_view
.
Example on std::basic_string_view
(std::basic_string
is quite the same)
vector<basic_string<CharT, Traits> > splits(const basic_string_view<CharT, Traits> &Separator) const vector<basic_string_view<CharT, Traits> > splitsv(const basic_string_view<CharT, Traits> &Separator) constThe purpose of theses method is to return a vector of string or of string_view.
auto MyResult= "my,csv,line"s.split(",");
s
and sv
suffixes are derived from the normalized literal suffixes.splitsv
has the advantage to be efficient in terms of CPU (no copy to do) and RAM (No memory to allocate for substring, just for the vector).splits
is useful if splitsv
can't be used. For instance, it's needed if you try to split a temporary object.
Several options presented here are method in both std::basic_string
and std::basic_string_view
.
It could make sense to implement them only in std::basic_string_view
for several reasons :
basic_string_view
is a new class then it's probably simpler to amend it to integrate theses features in C++ 17. Could be back ported later on std::basic_string
if needed.splitf
method will split the input string and call an unary functor with a std::basic_string_view
as a parameter
template <class F> void splitf(const basic_string_view&Separator,F functor) const
Some person does not wish to have a container returned to avoid its memory allocation, splitf
is one possible method to address this concern.
Some person wishes to execute a processing on each value, splitf
is a more direct way to address this request than iterate on splitsv
result with range based for.
The transmitted string_view allows compute the position of the substring in the initial string, it was highlighted as a potential need.
Example of usage displaying substring, initial position and length
strsv.splitf(" ", [&](const string_view &s) { cout << s <<" ,Pos="<<(s.data() -strsv.data())<<" ,Len="<<s.length()<< endl; });Example of implementation in [2]
splitc
method will split the input string and append substring in the container passed as output parameter.
template <class T> void splitc(const basic_string_view<CharT, Traits> &Separator,T &Result) const
Append operation can be done using emplace_back
(preferred option) or push_back
.
Some person does not wish to use a vector container, this option allows transmit a wide range of containers to address this concern.
This option also allows feed a container with another string type in it. Can be done in this way if it can be built from a string_view.
Example of usage
vector<string_view> vector5; strsv.splitc(" ", vector5); vector<string> vector6; strsv.splitc(" ", vector6);Example of implementation in [2]
Instead of using suffixes to select the version, it would be nice to have an automatic selection of the right split
version.
splits
& splitsv
have only 1 parameter and can easily be differentiated from splitf
& splitc
.
splits
& splitsv
will be selected by parametring the return type.
str.split<string>(" "); ==> Returns vector<string> str.split<string_view>(" "); ==> Returns vector<string_view>
It's not very user-friendly, but if we set a default type it can be better.
I selected basic_string_view
as default type as it's the most interesting in most cases and the longer to type (string_view=11 char).
When other type is needed (i.e. string), it can be specified, it can also be another string type if convertible from string_view
template<class StringType=basic_string_view<CharT, Traits>,class SeparatorType > vector<StringType > split(const SeparatorType &Separator) const str.split<string>(" "); ==> Returns vector<string> str.split(" "); ==> Returns vector<string_view>
An alternate solution was proposed by Jakob Riedle with a specialization based on the fact invoker is a lvalue or a rvalue.
This solution has the advantage it automatically discards the usage of string_view
on temporary values,
but it makes it a bit more complex if you really want split as a copy in a vector<string>
, it can, however, be done using splitc(MyStringVector)
. I've tested it successfully with this prototype on GCC&VS2017
template<class SeparatorType > vector<basic_string_view> split(const SeparatorType &Separator) const &; template<class SeparatorType > vector<basic_string > split(const SeparatorType &Separator) const &&;
To make the difference between splitf
& splitc
version, enable_if
& can be used.
I implemented it by detecting if the parameter has a clear method, if yes I assume it's a container, else I assume it's a functor.
Example of implementation in [6]
If the separator is an empty string "" or the separator char is 0 then split methods will split on every char."abc"=>["a","b","c"]
If the input string is an empty string "" then split methods will return an empty container or never call the callback function.
It was highlighted that split by a single char can be optimized regarding splitting by a string. Then it was suggested an overload for single char Separator.
A counter-argument was this optimization can (and should) be detected at runtime or by a non-standardized overload added by the implentor.
But why spend time to check if it can be determined at compilation.
For information the string::find
method as standardized the char overload too.
It was decided to add the char overload as it also avoid the creation of the temporary string_view object.
The implementors can decide to forward it to the string_view version if it don't want to optimize this case.
Specify it as part of the standard is an incitation to optimize this case as it's probably the most common case.
Example of implementation in [2]
It was highlighted that split by a set of single char can be interesting, then any single character appearing in the set will be considered as a separator.
The main question is how to transmit this set of character?
- Using a string may be confusing with splitting by string (except if it's considered split by string is not interesting)
- Using a std::initializer_list
template <class F>
void split(std::initializer_list<CharT> SeparatorList, F functor) const
usage : string_view("a.b-c,. d, e .f-").split({'.',',','-'})
- Using a std::vector
but it call is required to be heavier, as previous call cause ambiguity errors
usage : string_view("a.b-c,. d, e .f-").split(vector<char>({'.',',','-'}))
- Using a variadic , but if unified naming is used, may complexify the 2 parameters method selection
usage : string_view("a.b-c,. d, e .f-").split('.',',','-')
- Using a encapsulating class like ByAnyChar
as it done in abseil
usage : string_view("a.b-c,. d, e .f-").split(ByAnyChar(".,-"))
- Using an extra parameter with a enum valuer like SplitByAnyChar
usage : string_view("a.b-c,. d, e .f-").split(".,-",SplitByAnyChar)
Example of implementation using initializer_list in [6]
My personnal preferences are initializer_list and encapsulating class as they are simple to use and implement.
It was highlighted that split by a regexp can be useful, it can also be a way to implement the split by a set of separators, then an overload of split
with regexp
could make sense.
It may introduce a dependency on regexp which is perhaps not a good idea.
But it could be implemented by a regexp_split function.
Perhaps it's not an issue to have a dependency with another standard class. In this case, an overload on member-function may make sense.
This option is part of the proposed text.
Example of implementation in [2]
It N3593 Proposal [5] there's the idea of using a template delimiter object which must implement a find method.
The proposal suggests providing 4 built-in classes for char, string, any_of_string, regex.
It's a smart idea as it allows reduce the number of overload and provide the possibility to extend the concept.
However I think that the benefit provided by reduction of overload is lost by the introduction of 4 new classes. And the interest of adding a new kind of splitter is reduced.
Then I consider this solution is a bit more complex for a reduced benefit.
string_split
function algorithm may replace member function as it doesn't require special access to the class.
Pros are :
* avoid implementing it in both classes.
It's true, methods have to be declared on both classes, however, in basic_string
it can be a generic forwarder (below examples used for unified split).
// SPLIT Version replacing splits & splitsv for any separator template* It can be used with alternate string class, however, it will probably require they have a standardized way to extract substring by having a substr method (it's not the case of qstring, Cstring, AnsiString ) making this argument less valuable, class SeparatorType > vector<StringType > split(const SeparatorType &Separator) const { basic_string_view17<CharT, Traits> sv(this->c_str(),this->size()); return sv.split<StringType>(Separator); } // SPLIT Version replacing splitf & splitc for any separator template<class SeparatorType , class TargetType> void split(const SeparatorType &Separator, TargetType Target) const { basic_string_view17<CharT, Traits> sv(this->c_str(), this->size()); return sv.split(Separator, Target); }
const char *MyCharStr="My char* to split";
vector4 = string_split(MyCharStr," ");
vector4 = string_view(MyCharStr).split(" ");
char*
works with this definition (CharT deduction impossible)template <class CharT,class Traits = std::char_traits<CharT>, class StringType = basic_string_view<CharT, Traits> > vector<StringType > string_split(const basic_string_view17<CharT, Traits> &InputStr, const basic_string_view<CharT, Traits> &Separator)But was possible if I skip the CharT with this definition instead (but it will require to define a function for each possible CharT that is opposed to the 1st Pros argument)
template <class StringType = string_view >
vector<StringType > string_split(const string_view17 &InputStr, const string_view &Separator)
mystring.split(',') //seems to me more OOP string_split(mystring,',') //seems to me more C/Procedural
string_split
function algorithm will return a range and can potentially replace all the previously discussed method
Example of usage
string MyStr("my,csv,line") vector<string> MyResult(string_split(MyStr, ",")); // Efficient conversion from string to string_view may require an explicit initial casting vector<string_view> MyResult(string_split(string_view(MyStr), ",")); // splitf replacement std::for_each(string_split(string_view(MyStr), ","), callback);Example of implementation in [3]&[4]
splitr
method returning a range may have sense.
The Temporary object issue
During the evaluation of the prototype it was clearly appears there's a problem with temporary objects.
string string_toLower(string Str) { std::transform(Str.begin(), Str.end(), Str.begin(), ::tolower); return Str; } int main() { for(auto x : split_string(vstr1,regex("\\s"))) <== (1) . . . . . for(auto x : split_string(string_toLower("C++ Is Very Fun!!!")," ")) <== (2) . . . . . }Code (1) will not work because
split_string
use a reference to the regex
but has this one doesn't exist anymore when
split_string
returns, the returned iterator got a problem because the regex object doesn't exist anymore.Code (2) will not work either and it's a bigger problem. It's for the same reason, when we use the returned iterator, the lowered string doesn't exist anymore. A workaround could be to systematically make a copy of the inputString but it doesn't make sense from a memory and a performance perspective. It will also invalidate returned string_view in several other usage as the returned string view will not be on the initial object but on the saved copy.
A complementary test done using split action of range-v3 library seems to confirm this fact :
vectorThis instruction display this assertion :vect1= string_toLower(str8) | view::split(',');
error: static assertion failed: You can't pipe an rvalue container into a view. First, save the container into a named variable, and then pipe it to the view.Which let think that, by design, range cannot handle temporary values.
Solutions Comparison
Since the first answer in the proposal thread on google groups, there's a debate between methods vs function. Bellow a factual comparison on some use case.
Note : Other options like unified name split or non member function can be consider but they are pretty close of method option.
Usage | Range function Option | Method Option | Remarks |
---|---|---|---|
Split in vector of string_view | vector<string_view> vec =string_split(str,","); | auto vec=str.splitsv(","); | Function solution allocate 1 extra split_range_string and 3 extra iterators (first_,past_last_,the internal of for) |
Split in vector of string | vector<string> vec=string_split(str,","); | auto vec=str.splits(","); | Function solution allocate 1 extra split_range_string and 3 extra iterators (first_,past_last_,the internal of for) +creation of intermediate string_view |
Split in list of string_view | list<string_view> lst =string_split(str,","); | list<string_view> lst; str.splitc(",",lst); |
Function solution use the same notation for every container. Method solution requires a different coding for other containers but is more efficient as it doesn't require extra object creation. |
Split over a function | for_each(string_split(str,","),MyFunction); | str.splitf(",",MyFunction); | Function solution allocate extra objects |
Split over a loop (a) | for(s :string_split(str,",")) | for(s :str.splitsv(",")) | Methods solution allocate an extra vector of string_view |
Split over a loop (b) | for(s :string_split(str,",")) | strsv.splitf(" ", [&](const string_view &s) { cout << s << endl; }); |
Methods solution doesn't allocate the extra vector and no extra object like function solution but the notation is quite less conventional |
Temporary object | vec=string_split(GetTmpObject(),","); | vec=GetTmpObject().splits(","); |
Function solution : Doesn't work (see Temporary object issue paragraph) Methods solution : works fine |
From alternate string class (Must be convertible from string_view) | vec=string_split(MyOtherString,","); | vec=string_view(MyOtherString).splitsv(","); | both options use in fact the string_view implementation. string_split is not instanciated for the alternate string class. |
C++ 20 will bring a new feature named range that will allow simplify operations on several stuff, thanks to Eric Niebler for his strong commitment on that.
Range-V3 (ancestor of the proposal) include a split action allowing split containers by an element, a range or a predicate.
This solution would fit the need but with some restriction I think.
Ranges are designed to handle general containers with general purpose content, at the opposite this proposal is string oriented and integrate easy to use features.
Split by a single char with range is easy to do and can be a reasonable alternative : vector<string>vect1= str8 | view::split(',');
Use range to split by a substring or a regex will be significantly more complex and require significant pieces of technical code.
Take advantage of string_view is possible with range but it requires an explicit cast of the input string then I consider it should be the default case.
As described in previous paragraph (range function) range usage cause creation of several intermediate objects (the range, iterators),
if we consider the case of parsing a 100 000 line CSV file of 10 columns where each line is handled using the random access operator on the application.
I've the feeling that using the splitc
variant on a vector will be simpler to write and cause significantly less allocation of internal objects.
I've done performance tests using range::split, it's 14 time slower, it's quite normal as it's not designed to handle this simple case.
Originally the STL use iterators to handle containers, this choice brings a lot of flexibility but also brings a bit of verbosity.
To copy vector V1 in deque D1 we will write copy(V1.begin(), V1.end(),std::back_inserter(D1));
when it could be written : copy(V1,D1);
Iterators allows to read only a subset of V1 or to do not use a container or to write with various adapter but if you just want to copy a container in another one, it's a bit longer to write.
For few years, we are talking about ranges as a solution to simplify invoking STL algorithm (it's not the only motivation of this concept) to allow us to write sort(V1)
instead of sort(V1.begin(), V1.end())
, the idea here is to hide iterator when we don't need it (like C++11 range based for).
In this proposal all parameters are containers and not iterators.
The motivation of this proposal is to bring an easy to use solution to split/join strings, so I have naturally decided to take containers as parameter and not iterators in order to have a simpler invocation.
I whish we use strsv.splitc(" ", vector5);
instead of strsv.splitc(" ", back_inserter(vector5));
This way of use is not common in STL, as this is not a common/generic algorithm, like copy, I think it's a good opportunity to introduce this way of work, and perhaps in the future with the help of modern metaprogramming tools we will be able to write copy(V1,D1);
However, it can be considered important to still provide an iterator interface to do not break the iterator logic,in this case introducing spliti
could be a solution.
template<class SeparatorType,class OutputIterator > void spliti(const SeparatorType &Separator,OutputIterator it) const { splitf(Separator,[&](const basic_string_view<CharT, Traits> &s){it=s;}); }
There's already several soluiton providing a split function, but generaly in an alternate string class
Abseil
Provide an alternate string class and an alternate string_view class. absl::StrSplit()
provide Strings Splitting.
abseil allows to split only on container (multiple kind usinf adaptator), no iterator or range provided.
Delimiter are specified by specific classes
absl::ByString() (default for std::string arguments)
absl::ByChar() (default for a char argument)
absl::ByAnyChar() (for mixing delimiters)
absl::ByLength() (for applying a delimiter a set number of times)
absl::MaxSplits() (for splitting a specific number of times on a single char)
Abseil also provide the concept of filter and provide 2 (SkipEmpty and SkipWhitespace)
POCO
Provide an alternate StringTokenizer class allowing to split a std::string using a set of single char separator.
It act as a vector of string.
It provide options to skip empty and to trim result.
Boost:split
Provide a solution to fill a container from a string using a set of single char separator (using predicate)
It provide options to skip empty.
Options provided by some solution to skip empty or trim result can be done using split functor
strsv.split(",", [&](string_view s) { if(s!=""){ // filter s.remove_suffix(1); // Transform vector4.emplace_back(s); } });but integrating options on method parameter could make sense if we consider it will be used frequently
join
static method will join a list of input string transmitted on a iterable container and add a delimiter between each value
template<class T,class U> static basic_string<CharT, Traits, Allocator> join(T &InputStringList, U Separator)
Example of usage (simple but so useful)
cout << "Join of string vector=" << string::join(vector6, "_") << endl; cout << "Join of string_view vector=" << string::join(vector5, "_") << endl;Example of implementation in [2]
join
method will use the current string as a separator to join the list, it's the way of usage of join in Python
template<class T> basic_string<CharT, Traits, Allocator> join(T &InputStringList)
Example of usage
cout << "pythonic Join of string =" << "-"s.join(vector6) << endl;This option could also be part of
std::basic_string_view
class, the static option make less sense.
string_join
function act exactly as static function but is more adequate if string_split
function is the selected option for split.
template<class T,class U> basic_string<typename T::value_type::value_type, typename T::value_type::traits_type> string_join(T &InputStringList, U Separator)
Example of usage (simple but so useful)
cout << "Join of string_view vector with string_join=" << string_join(vector5, "_") << endl;Example of implementation in [2]
It's possible to optimize the processing by iterating on the container a first time to compute the size of the final string and reserve it below an example on
template<class T> basic_string<typename T::value_type::value_type, typename T::value_type::traits_type> string_join(const T &InputStringList , const basic_string_view<typename T::value_type::value_type,typename T::value_type::traits_type> Separator) { basic_string<T::value_type::value_type, T::value_type::traits_type> result_string; size_t StrLen = 0; if (InputStringList.empty()) return result_string; auto it = InputStringList.begin(); for (; it != InputStringList.end(); ++it) StrLen += it->size() + Separator.size(); result_string.reserve(StrLen); result_string += *InputStringList.begin(); for (it = ++InputStringList.begin(); it != InputStringList.end(); ++it) { result_string += Separator; result_string += *it; } return result_string; }
However it implies to be able to obtain the length of each string in both InputStringList and Separator.
We can consider it's quite common that the separator will be a char*
but it doesn't have size()
member, so as a workaround the separator is specified as const string_view
.
The problem is the same if InputStringList
is a vector<char*>
, but in this case the problem is bigger as it seems to me impossible to specify the type of the returned string.
In several function like string_join it was highlighted the order could be with the separator first as it's smaller than the container (or the input string).
In some language the standard implementation is with the separator first : PHP , C#, Java
In some others, it's separator last Go, Rust, boost::algorithm::join, LibC strtok
Python has a different logic as the separator is the "caller" object (like join classic method described earlier)
Having the separator as 2nd parameters would allow have it optional with ""
It has to be analyzed later, but the consensus seems to be for the separator last.
vector<basic_string<CharT, Traits> > splits(const basic_string_view &Separator) const; vector<basic_string<CharT, Traits> > splits(const typename basic_string_view::value_type Separator) const; vector<basic_string<CharT, Traits> > splits(const basic_regex<CharT> &Separator) const; vector<basic_string_view> splitsv(const basic_string_view &Separator) const; vector<basic_string_view> splitsv(const typename basic_string_view::value_type Separator) const; vector<basic_string_view> splitsv(const basic_regex<CharT> &Separator) const; template <class F> void splitf(const basic_string_view &Separator,F functor) const; template <class F> void splitf(const typename basic_string_view::value_type Separator,F functor) const; template <class F> void splitf(const basic_regex<CharT> &Separator,F functor) const; template <class T> void splitc(const basic_string_view &Separator,T &Result) const; template <class T> void splitc(const typename basic_string_view::value_type Separator,T &Result) const; template <class T> void splitc(const basic_regex<CharT> &Separator,T &Result) const;
vector<basic_string<CharT, Traits> > splits(const basic_string_view &Separator) const vector<basic_string<CharT, Traits> > splits(const typename basic_string_view::value_type Separator) const vector<basic_string<CharT, Traits> > splits(const basic_regex<CharT> &Separator) const
Effects: split a string based on the separator and return the result in a vector of string
The separator can be :
* a string
* a single char
* regexp
Returns: vector<string>
Remarks: if string is empty (size()==0) or single char==0 then the string will be split on every char
If the input string is an empty string "" then split methods return an empty container or never call the callback function.
vector<basic_string_view> splitsv(const basic_string_view &Separator) const vector<basic_string_view> splitsv(const typename basic_string_view::value_type Separator) const vector<basic_string_view> splitsv(const basic_regex<CharT> &Separator) const
Effects: split a string based on the separator and return the result in a vector of string_view
The separator can be :
* a string
* a single char
* regexp
Returns: vector<string_view>
Remarks: if string is empty (size()==0) or single char==0 then the string will be split on every char
If the input string is an empty string "" then split methods return an empty container or never call the callback function.
template <class F> void splitf(const basic_string_view &Separator,F functor) const template <class F> void splitf(const typename basic_string_view::value_type Separator,F functor) const template <class F> void splitf(const basic_regex<CharT> &Separator,F functor) const
Effects: split a string based on the separator and call a unary function for each occurence
The separator can be :
* a string
* a single char
* regexp
Returns: void
Remarks: if string is empty (size()==0) or single char==0 then the string will be split on every char
If the input string is an empty string "" then split methods return an empty container or never call the callback function.
template <class T> void splitc(const basic_string_view &Separator,T &Result) const template <class T> void splitc(const typename basic_string_view::value_type Separator,T &Result) const template <class T> void splitc(const basic_regex<CharT> &Separator,T &Result) const
Effects: split a string based on the separator and return the result in the container passed as output parameter
The separator can be :
* a string
* a single char
* regexp
Returns: void
Remarks: if string is empty (size()==0) or single char==0 then the string will be split on every char
If the input string is an empty string "" then split methods return an empty container or never call the callback function.
vector<basic_string> splits(const basic_string_view<CharT, Traits> &Separator) const; vector<basic_string> splits(const typename basic_string_view<CharT, Traits>::value_type Separator) const; vector<basic_string> splits(const basic_regex<CharT> &Separator) const; vector<basic_string_view<CharT, Traits> > splitsv(const basic_string_view<CharT, Traits> &Separator) const; vector<basic_string_view<CharT, Traits> > splitsv(const typename basic_string_view<CharT, Traits>::value_type Separator) const; vector<basic_string_view<CharT, Traits> > splitsv(const basic_regex<CharT> &Separator) const; template <class F> void splitf(const basic_string_view<CharT, Traits> &Separator,F functor) const; template <class F> void splitf(const typename basic_string_view<CharT, Traits>::value_type Separator,F functor) const; template <class F> void splitf(const basic_regex<CharT> &Separator,F functor) const; template <class T> void splitc(const basic_string_view<CharT, Traits> &Separator,T &Result) const; template <class T> void splitc(const typename basic_string_view<CharT, Traits>::value_type Separator,T &Result) const; template <class T> void splitc(const basic_regex<CharT> &Separator,T &Result) const; template<class T,class U> static basic_string<CharT, Traits, Allocator> join(T &InputStringList, U Separator);
Exactly the same behavior as their basic_string_view
equivalent it's a shortcut.
template<class T,class U> static basic_string<CharT, Traits, Allocator> join(T &InputStringList, U Separator)
Effects: return a string which join all string contained in the InputStringList. Add a separator between each string. If there's N string there's N-1 separator inserted.
Returns: Aggregated string
Remarks: if InputStringList is empty method returns an empty string.