Document number | P3223R1 |
Date | 2024-07-03 |
Project | Programming Language C++, Library Evolution Working Group |
Reply-to | Jonathan Wakely <cxx@kayari.org> |
Changes in R1:
to_int_type
details.std::istream
not std::basic_istream
.std::istream::ignore(n, delim)
has surprising behaviour
if delim
is a char
with a negative value.
We can remove the surprise and make code more robust.
Passing a char
to std::istream::ignore
as the delimiter is a bug.
The delimiter argument should be an int_type
value, so correct code
must do is.ignore(n, std::char_traits<char>::to_int_type(delim))
instead.
There is no guarantee that an implicit conversion from a char
value
to int_type
yields the same value as calling to_int_type
. On platforms
where char
is signed, if the value of EOF
fits in char
then there
is always at least one char
value (the one equal to EOF
) that cannot
be used as a delimiter to std::istream::ignore
, because c == to_int_type(c)
is false.
For example, on a platform where char
is signed, EOF
equals -1
, and the
literal encoding is ISO-8859-1 or Windows-1252, the call is.ignore(n, 'ÿ')
will never match the delimiter, because 'ÿ' == EOF
is true.
In theory, the to_int_type
mapping could be ~c
or c + 256
, where the
former means that most char
values will match unintended characters, and the
latter means that no char
value passed directly to ignore
will ever match.
In practice, real implementations use a more straightforward mapping, such
that calling to_int_type(c)
is equivalent to (int)(unsigned char)c
and
iostream functions that take an int_type
argument expect values between -1
and 255.
This matches how the C functions like isalpha(int)
and toupper(int)
work.
So for all known implementations, an implicit conversion from char
to
int_type
is incorrect for any negative char
value.
This section assumes an implementation where to_int_type
is equivalent to
casting to unsigned char
, but even for a hypothetical implementation where
that isn't true, is.ignore(n, ch)
is still wrong in general, as shown above.
A delim
value passed to std::istream::ignore
is matched using
traits_type::eq_int_type(rdbuf()->sgetc(), delim)
. The sgetc()
function
never returns negative values (except at EOF). It extracts a character
c
from the input sequence and then calls traits_type::to_int_type(c)
to
convert it to int_type
, which produces a non-negative value. So if delim
is negative, then the eq_int_type
comparison always fails.
The std::char_traits<char>::to_int_type(char c)
function converts the
character c
to a non-negative integer, as if by (int)(unsigned char)c
.
This allows the value (int_type)-1
to be reserved for EOF without worrying
about whether char
is signed and whether (char)-1
can equal EOF
.
But it means that any code dealing with raw char
values from the stream must
consistently use to_int_type
to convert the (possibly negative) char
value
into a non-negative int_type
so that all characters are represented in the
same form and can be compared like-for-like.
So if a negative delimiter char
can never match, this means that users should
call std::cin.ignore(n, std::char_traits<char>::to_int_type(c))
or
std::cin.ignore(n, (unsigned char)c)
in case char
is signed on their platform, and c
happens to have the most
significant bit set, i.e., is a negative value. For example, on a typical
x86_64 Linux system where char
is signed,
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\x80')
will always discard all input up to EOF, even if \x80
is present in the
input.
In generic code that works with streams of either char
or wchar_t
you need to use the more verbose to_int_type
form rather than casting to
unsigned char
, because casting a wchar_t
to unsigned char
would be wrong.
That's unfortunate, because std::char_traits<wchar_t>::to_int_type
doesn't
have the same "cast to unsigned" behaviour, and so passing a wchar_t
directly to std::wistream::ignore
without conversion works correctly.
But to work with both char
and wchar_t
, generic code must assume the worst
and defend against the signed char
trap.
Even in non-generic code that only works with char
, you still need to
remember that the trap exists, and remember to avoid it.
It's rare that I see anybody get that right for std::isalpha(c)
et al,
and I wasn't even aware of the need to do it for ignore
until this week.
I suspect that most users are not aware of the need to use to_int_type
here,
which means that the ignore
function is surprisingly fragile.
It's also quite ugly to have to cast or deal with the traits type directly in
a high-level istream
API like ignore
, which is not a low-level streambuf
member function. Several basic_streambuf
members use int_type
for arguments
and return values, but at the basic_istream
level the other unformatted input
functions that take a delimiter (get
and getline
) take a char_type
.
That means they are agnostic to whether char
is signed or unsigned,
and any conversion to int_type
is done by the stream, not expected to be
done by the caller. Arguments of type int_type
ignore
to handle negative delim
valuesWe could modify the spec for basic_istream::ignore
so that negative values
(except for -1 which must be reserved for EOF)
are automatically fed through to_int_type
to "clean" them, so that they're
in the same domain as the values returned by sgetc()
.
The GNU C library takes this approach for its <ctype.h>
functions,
which are specified to take int
, but which have undefined behaviour if the
argument is not an unsigned char
value or EOF
. So isalpha('\x80')
has
undefined behaviour if char
is signed. But with Glibc, it works.
Values in the range [-128,-1) are handled as if converted to unsigned char
automatically,
so that negative char
values don't produce undefined behaviour. Obviously
it's still possible to misuse those functions, e.g., by passing an int
value
outside the range of char
or unsigned char
, e.g., isalpha(1000)
. But
the apparently simple isalpha(c)
for a char
value isn't undefined just
because c
happens to be a negative value of a signed type.
Taking the same approach for ignore
would remove the trap for cases like
cin.ignore(n, '\x80')
, making it behave as
cin.ignore(n, std::char_traits<char>::to_int_type('\x80'))
.
However, it would also "fix" cases like cin.ignore(n, -10)
which are less
likely to be correct. There would be no way to tell the difference between
a char
with the value -10
and an int
with the value -10
, but the latter
seems odd, and possibly a bug. Depending how we specified it, this solution
might also give defined behaviour to cin.ignore(n, +1000)
and
cin.ignore(n, -9999)
which are just nonsense.
The other downside of this solution is that it only fixes negative delimiters
less than or equal to -2, because -1 still has to be reserved to mean EOF.
So on a platform where char
is signed, cin.ignore(n, '\xfe')
would work,
but cin.ignore(n, '\xff')
would not, because that value is
traits_type::eof()
. On a platform where char
is unsigned, both work.
So this solution removes most surprises, but not all, and doesn't have
portable guarantees. Users should really still use
in.ignore(n, std::char_traits<char>::to_int_type(c))
to work with arbitrary
delimiter characters on arbitrary platforms.
ignore
into two functions and change delim
to char_type
We could change delim
to char_type
and make ignore
convert that to
int_type
internally by using to_int_type
.
basic_istream& ignore(streamsize n = 1, int_type = traits_type::eof());
basic_istream& ignore(streamsize n, char_type delim);
This would preserve the same behaviour for in.ignore(n)
, i.e., ignore up
to n
characters or up to EOF, whichever happens first. But it would break
explicit uses of eof as the second argument, e.g.,
in.ignore(n, traits::eof())
. This would implicitly convert the eof()
value to char_type
and treat is as a delimiter. If (char)-1
happens to be
present in the next n
characters of the input sequence, it would match and
we would stop ignoring too soon.
This would break too much code, and we can do better.
ignore
that takes a char_type
We don't need to change the existing ignore
, we can just add an overload
that does the correct conversion to int_type
:
basic_istream& ignore(streamsize n, char_type delim)
{ return ignore(n, traits_type::to_int_type(delim)); }
This does exactly what users expect when calling ignore(n, c)
with a
char_type
argument (even (char)-1
), with no alarms and no surprises.
The behaviour is entirely consistent for signed or unsigned char
and always
matches the given delim
if it occurs in the input sequence.
The downside of this solution is that calls that pass a delimiter that is
neither int_type
nor char_type
become ambiguous,
e.g. std::cin.ignore(n, 1ULL)
is valid today but would become ambiguous
with this new overload. Arguably, that's a good thing.
What is this code trying to do? What if it passes a value that doesn't even
fit in int_type
, e.g. numeric_limits<int_type>::max()+1LL
?
Maybe it's good for such calls to not compile.
Since the problem only exists for char
istreams and not wchar_t
,
this new overload could be specified as present only when char_type
is
char
. That would avoid introducing any ambiguities for std::wistream
.
basic_istream& ignore(streamsize n, same_as<char_type> auto delim)
{ return ignore(n, traits_type::to_int_type(delim)); }
This will only be selected by overload resolution when called with an argument
of type char_type
. For all other argument types, the existing overload will
be selected and if the argument isn't of type int_type
it will implicitly
convert to int_type
, exactly as happens today.
This doesn't change the meaning of any existing code, except for calls with
negative char_type
values which would start to work as users probably
expected them to all along. The downside is the additional complexity of
using a constrained overload, which would need to be emulated with SFINAE if
vendors wanted to backport this fix to older standards modes.
Personally, I think the non-constrained overload is the best option, and that
the cases which become ambiguous should probably be fixed to clarify what
they're intending to do. Making the conversion to int_type
explicit
(and using to_int_type
if appropriate) would probably be an improvement.
Boo! Users deserve better. Well, most of them. Some don't.
The edits are shown relative to N4981.
Modify the class synopsis in [istream.general] as shown:
// [istream.unformatted], unformatted input
streamsize gcount() const;
int_type get();
basic_istream& get(char_type& c);
basic_istream& get(char_type* s, streamsize n);
basic_istream& get(char_type* s, streamsize n, char_type delim);
basic_istream& get(basic_streambuf<char_type, traits>& sb);
basic_istream& get(basic_streambuf<char_type, traits>& sb, char_type delim);
basic_istream& getline(char_type* s, streamsize n);
basic_istream& getline(char_type* s, streamsize n, char_type delim);
basic_istream& ignore(streamsize n = 1, int_type delim = traits::eof());
basic_istream& ignore(streamsize n, char_type delim);
int_type peek();
basic_istream& read(char_type* s, streamsize n);
streamsize readsome(char_type* s, streamsize n);
Modify [istream.unformatted] as shown:
basic_istream& ignore(streamsize n = 1, int_type delim = traits::eof());
-25- Effects: Behaves as an unformatted input function (as described above). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:
(25.1) —
n != numeric_limits<streamsize>::max()
([numeric.limits]) andn
characters have been extracted so far
(25.2) — end-of-file occurs on the input sequence (in which case the function callssetstate(eofbit)
, which may throwios_base::failure
([iostate.flags]));
(25.3) —traits::eq_int_type(traits::to_int_type(c), delim)
for the next available input characterc
(in which casec
is extracted).[Note 1: The last condition will never occur if
traits::eq_int_type(delim, traits::eof())
. — end note]-26- Returns:
*this
.
basic_istream& ignore(streamsize n = 1, char_type delim);
-?- Constraints:
is_same_v<char_type, char>
istrue
.-?- Effects: Equivalent to:
return ignore(n, traits_type::to_int_type(delim));
Thanks to Iain Sandoe and Ulrich Drepper for inspiring this. Thanks to Tom Honermann for getting me to state the problem in the general case.
N4981, Working Draft - Programming Languages -- C++, Thomas Köppe, 2024.