ISO/IEC JTC1 SC22 WG21 N2534 = 08-0044 - 2008-03-17
Alisdair Meredith,
alisdair.meredith@codegear.com
Hans Boehm,
Hans.Boehm@hp.com
Lawrence Crowl,
Lawrence@Crowl.org,
crowl@google.com
Peter Dimov,
pdimov@pdimov.com,
The current definition of basic_string
allows only very limited concurrent access to strings.
Such limited concurrency
will inhibit performance in multi-threaded applications.
We categorize string operations as follows:
These operations must be single-threaded and define the temporal bounds of all other operations on an object. However, copy construction may induce a shared representation on the argument string in copy-on-write implementations.
These operations will (probably) change the state of a string variable regardless of the implementation. Furthermore, some may introduce a shared representation on the argument string in copy-on-write implementations.
operator= resize reserve clear operator+= append assign pushback insert erase popback replace swap operator<< getline
(Any library function that accepts a string by non-const reference.)
These operations provide pointers or references to the internal string representation.
begin
end
rbegin
rend
operator[]
at
front
back
For non-const strings, iterator and element access operations may require copy-on-write implementation to 'unshare' the string to enable write access to the referenced characters.
Furthermore, the following operations may also require 'unsharing' the string data to provide contiguous null-terminated character arrays.
c_str
data
For const strings, iterators and element access operations do not inhibit concurrent accesses, nor do references through their results.
Iterators and references may be invalidated by:
Since only the first 'unsharing' operation following a 'mutating' must make the string not sharable, further reference or iterator creations cannot invalidate anything.
Thus the only real anomaly seems to be that
creating the initial non-const reference
invalidates previously generated const references.
This anomaly has unfortunate consequences.
Typically, when v
offers "container thread safety", we can do
#pragma omp parallel for for( size_t i = 0, n = v.size(); i < n; ++i ) { v[i] = 0; }
However, for a basic_string v, we must insert
v[0];
before the pragma to render the code correct.
Such a subtlety is difficult to teach and difficult to review.
Similarly, we would expect to be able to write s[0]+s[1]
to
add the first two characters of a string. And indeed this is correct if
either s
is a string
or const string
.
However similar examples become incorrect if only one access to s
is as a const string
, and the other access is through
a non-const reference to the same string. In that case, the second
operator[]
invocation may invalidate the first character
reference before the character value can be obtained.
As these examples illustrate, this issue is not completely new with the introduction of threads, but it is aggravated by it.
There are also performance implications to the current design. In a copy-on-write implementation,
const string empty(""); vector<string> v; #pragma omp parallel for for( size_t i = 0, n = v.size(); i < n; ++i ) { v[i] = empty; }
may run very slowly, since each iteration writes to the representation of empty, and thus is likely to generate a conflict cache miss. This issue may be secondary, but it argues that it is really hard to write efficient code if you do not know whether you have a copy-on-write implementation or not.
Any operation that may change the size of a string, or any standard library function that accepts a string by non-const reference, may potentially write to the string representation and thus is not suitable for concurrent operation.
The goal is that any operation that returns an internal reference or pointer can be called concurrently safely, but any write through the reference or pointer is not safe. However, this is not possible with a shared-buffer implementation, such as the classic reference counted variant.
Change c_str
and data
to not invalidate iterators and references.
This change effectively requires null-terminated buffers. [Note: We think the wording is sufficient, but need confirmation.] For implementations that inline critical operations in user code, this change may affect the Application Binary Interface (ABI).
We chose to leave swap
as an invalidating operation
to enable the small string optimization.
In paragraph 4, edit
References, pointers, and iterators referring to the elements of a
basic_string
sequence may be invalidated by the following uses of thatbasic_string
object:
- As an argument to any standard library function taking a reference to non- const
basic_string
as an argument(footnote X)As an argument to non-member functionsswap()
(21.3.8.8),operator>>()
(21.3.8.9), andgetline()
(21.3.8.9).As an argument tobasic_string::swap()
.Callingdata()
andc_str()
member functions.- Calling non-const member functions, except
operator[]
,at
,front
,back
,begin
,rbegin
,end
, andrend
.- Following construction or any of the above uses, except the forms of
insert
anderase
that return iterators, the first call to non-const member functionsoperator[]
,at
,begin
,rbegin
,end
, orrend
.
Insert a footnote:
footnote X : For example as an argument to non-member functions
swap()
(21.3.8.8), operator>>()
(21.3.8.9), and
getline()
(21.3.8.9), or as an argument to
basic_string::swap()
.
Edit paragraphs 1 through 4 as follows.
const charT* c_str() const;
const charT* data() const;
Returns: A pointer to the initial element of an array of length
size() + 1
whose firstsize()
elements equal the corresponding elements of the string controlled by*this
and whose last element is a null character specified bycharT()
.
Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of the classbasic_string
that designates the same object as this.
const charT* data() const;Returns: If
size()
is nonzero, the member returns a pointer to the initial element of an array whose firstsize()
elements equal the corresponding elements of the string controlled by*this
. Ifsize()
is zero, the member returns a non-null pointer that is copyable and can have zero added to it.Requires: The program shall not alter any of the values stored in the character array.
Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function ofbasic_string
that designates the same object asthis
.
In addition to the weak proposal, we propose to make all iterator and element access operations safely concurrently executable.
This change disallows copy-on-write implementations. For those implementations using copy-on-write implementations, this change would also change the Application Binary Interface (ABI).
We are increasing the stability of operations even in sequential code.
In paragraph 4, edit
References, pointers, and iterators referring to the elements of a
basic_string
sequence may be invalidated by the following uses of thatbasic_string
object:
- As an argument to any standard library function taking a reference to non- const
basic_string
as an argument(footnote X)As an argument to non-member functionsswap()
(21.3.8.8),operator>>()
(21.3.8.9), andgetline()
(21.3.8.9).As an argument tobasic_string::swap()
.Callingdata()
andc_str()
member functions.- Calling non-const member functions, except
operator[]
,at
,front
,back
,begin
,rbegin
,end
, andrend
.Following construction or any of the above uses, except the forms ofinsert
anderase
that return iterators, the first call to non-const member functionsoperator[]
,at
,begin
,rbegin
,end
, orrend
.
Delete paragraph 5.
[ Note: These rules are formulated to allow, but not require, a reference counted implementation. A reference counted implementation must have the same semantics as a non-reference counted implementation. [ Example:string s1("abc"); string::iterator i = s1.begin(); string s2 = s1; *i = 'a'; // Must modify only s1
—end example ] —end note ]
Insert a footnote:
footnote X : For example as an argument to non-member functions
swap()
(21.3.8.8), operator>>()
(21.3.8.9), and
getline()
(21.3.8.9), or as an argument to
basic_string::swap()
.
Edit paragraphs 1 through 4 as follows.
const charT* c_str() const;
const charT* data() const;
Returns: A pointer to the initial element of an array of length
size() + 1
whose firstsize()
elements equal the corresponding elements of the string controlled by*this
and whose last element is a null character specified bycharT()
.
Requires: The program shall not alter any of the values stored in the array. Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function of the classbasic_string
that designates the same object as this.
const charT* data() const;Returns: If
size()
is nonzero, the member returns a pointer to the initial element of an array whose firstsize()
elements equal the corresponding elements of the string controlled by*this
. Ifsize()
is zero, the member returns a non-null pointer that is copyable and can have zero added to it.Requires: The program shall not alter any of the values stored in the character array.
Nor shall the program treat the returned value as a valid pointer value after any subsequent call to a non-const member function ofbasic_string
that designates the same object asthis
.
The largest potential loss in performance due to a switch away from copy-on-write implementations is the increased consumption of memory for applications with very large read-mostly strings. However, we believe that for those applications ropes are a better technical solution, and recommend a rope proposal be considered for inclusion in Library TR2.