| Document number | P3223R0 |
| Date | 2024-04-04 |
| Project | Programming Language C++, Library Evolution Working Group |
| Reply-to | Jonathan Wakely <cxx@kayari.org> |
std::istream::ignore(n, delim) has surprising behaviour
if delim is a char with a negative value.
We can remove the surprise and make code more robust.
A negative delim value passed to std::istream::ignore can never match
any character in the stream's input sequence, because the comparison is done
using traits_type::eq_int_type(rdbuf()->sgetc(), delim) and sgetc() never
returns negative values (except at EOF). sgetc() extracts a character c
from the input sequence and then calls traits_type::to_int_type(c)
to convert it to int_type, which produces a non-negative value.
The std::char_traits<char>::to_int_type(char c) function converts the
character c to a non-negative integer, as if by (int)(unsigned char)c.
This allows the value (int_type)-1 to be reserved for EOF without worrying
about whether char is signed and whether (char)-1 can equal EOF.
But it means that any code dealing with raw char values from the stream must
consistently use to_int_type to convert the (possibly negative) char value
into a non-negative int_type so that all characters are represented in the
same form and can be compared like-for-like.
So if a negative delimiter char can never match, this means that users should
call std::cin.ignore(n, std::char_traits<char>::to_int_type(c)) or
std::cin.ignore(n, (unsigned char)c)
in case char is signed on their platform, and c happens to have the most
significant bit set, i.e., is a negative value. For example, on a typical
x86_64 Linux system where char is signed,
std::cin.ignore(std::numeric_limits<std::streamsize>::max(), '\x80')
will always discard all input up to EOF, even if \x80 is present in the
input.
In generic code that works with streams of either char or wchar_t
you need to use the more verbose to_int_type form rather than casting to
unsigned char, because casting a wchar_t to unsigned char would be wrong.
That's unfortunate, because std::char_traits<wchar_t>::to_int_type doesn't
have the same "cast to unsigned" behaviour, and so passing a wchar_t
directly to std::wistream::ignore without conversion works correctly.
But to work with both char and wchar_t, generic code must assume the worst
and defend against the signed char trap.
Even in non-generic code that only works with char, you still need to
remember that the trap exists, and remember to avoid it.
It's rare that I see anybody get that right for std::isalpha(c) et al,
and I wasn't even aware of the need to do it for ignore until this week.
I suspect that most users are not aware of the need to use to_int_type here,
which means that the very useful ignore function is surprisingly fragile.
It's also quite ugly to have to cast or deal with the traits type directly in
a high-level istream API like ignore, which is not a low-level streambuf
member function. The other unformatted input functions that take a delimiter
(get and getline) take a char_type, so they are agnostic about whether
char is signed or unsigned, and any conversion to int_type is done by
the stream, not expected to be done by the caller.
ignore to handle negative delim valuesWe could modify the spec for basic_istream::ignore so that negative values
(except for -1 which must be reserved for EOF)
are automatically fed through to_int_type to "clean" them, so that they're
in the same domain as the values returned by sgetc().
The GNU C library takes this approach for its <ctype.h> functions,
which are specified to take int, but which have undefined behaviour if the
argument is not an unsigned char value or EOF. So isalpha('\x80') has
undefined behaviour if char is signed. But with Glibc, it works.
Values in the range [-128,-1) are handled as if converted to unsigned char automatically,
so that negative char values don't produce undefined behaviour. Obviously
it's still possible to misuse those functions, e.g., by passing an int value
outside the range of char or unsigned char, e.g., isalpha(1000). But
the apparently simple isalpha(c) for a char value isn't undefined just
because c happens to be a negative value of a signed type.
Taking the same approach for ignore would remove the trap for cases like
cin.ignore(n, '\x80'), making it behave as
cin.ignore(n, std::char_traits<char>::to_int_type('\x80')).
However, it would also "fix" cases like cin.ignore(n, -10) which are less
likely to be correct. There would be no way to tell the difference between
a char with the value -10 and an int with the value -10, but the latter
seems odd, and possibly a bug. Depending how we specified it, this solution
might also give defined behaviour to cin.ignore(n, +1000) and
cin.ignore(n, -9999) which are just nonsense.
The other downside of this solution is that it only fixes negative delimiters
less than or equal to -2, because -1 still has to be reserved to mean EOF.
So on a platform where char is signed, cin.ignore(n, '\xfe') would work,
but cin.ignore(n, '\xff') would not, because that value is
traits_type::eof(). On a platform where char is unsigned, both work.
So this solution removes most surprises, but not all, and doesn't have
portable guarantees. Users should really still use
in.ignore(n, std::char_traits<char>::to_int_type(c)) to work with arbitrary
delimiter characters on arbitrary platforms.
ignore into two functions and change delim to char_typeWe could change delim to char_type and make ignore convert that to
int_type internally by using to_int_type.
basic_istream& ignore(streamsize n = 1, int_type = traits_type::eof());
basic_istream& ignore(streamsize n, char_type delim);
This would preserve the same behaviour for in.ignore(n), i.e., ignore up
to n characters or up to EOF, whichever happens first. But it would break
explicit uses of eof as the second argument, e.g.,
in.ignore(n, traits::eof()). This would implicitly convert the eof()
value to char_type and treat is as a delimiter. If (char)-1 happens to be
present in the next n characters of the input sequence, it would match and
we would stop ignoring too soon.
This would break too much code, and we can do better.
ignore that takes a char_typeWe don't need to change the existing ignore, we can just add an overload
that does the correct conversion to int_type:
basic_istream& ignore(streamsize n, char_type delim)
{ return ignore(n, traits_type::to_int_type(delim)); }
This does exactly what users expect when calling ignore(n, c) with a
char_type argument (even (char)-1), with no alarms and no surprises.
The behaviour is entirely consistent for signed or unsigned char and always
matches the given delim if it occurs in the input sequence.
The downside of this solution is that calls that pass a delimiter that is
neither int_type nor char_type become ambiguous,
e.g. std::cin.ignore(n, 1ULL) is valid today but would become ambiguous
with this new overload. Arguably, that's a good thing.
What is this code trying to do? What if it passes a value that doesn't even
fit in int_type, e.g. numeric_limits<int_type>::max()+1LL?
Maybe it's good for such calls to not compile.
Since the problem only exists for char istreams and not wchar_t,
this new overload could be specified as present only when char_type is
char. That would avoid introducing any ambiguities for std::wistream.
basic_istream& ignore(streamsize n, same_as<char_type> auto delim)
{ return ignore(n, traits_type::to_int_type(delim)); }
This will only be selected by overload resolution when called with an argument
of type char_type. For all other argument types, the existing overload will
be selected and if the argument isn't of type int_type it will implicitly
convert to int_type, exactly as happens today.
This doesn't change the meaning of any existing code, except for calls with
negative char_type values which would start to work as users probably
expected them to all along. The downside is the additional complexity of
using a constrained overload, which would need to be emulated with SFINAE if
vendors wanted to backport this fix to older standards modes.
Personally, I think the non-constrained overload is the best option, and that
the cases which become ambiguous should probably be fixed to clarify what
they're intending to do. Making the conversion to int_type explicit
(and using to_int_type if appropriate) would probably be an improvement.
Boo! Users deserve better. Well, most of them. Some don't.
Modify the class synopsis in [istream.general] as shown:
// [istream.unformatted], unformatted input
streamsize gcount() const;
int_type get();
basic_istream& get(char_type& c);
basic_istream& get(char_type* s, streamsize n);
basic_istream& get(char_type* s, streamsize n, char_type delim);
basic_istream& get(basic_streambuf<char_type, traits>& sb);
basic_istream& get(basic_streambuf<char_type, traits>& sb, char_type delim);
basic_istream& getline(char_type* s, streamsize n);
basic_istream& getline(char_type* s, streamsize n, char_type delim);
basic_istream& ignore(streamsize n = 1, int_type delim = traits::eof());
basic_istream& ignore(streamsize n, char_type delim);
int_type peek();
basic_istream& read(char_type* s, streamsize n);
streamsize readsome(char_type* s, streamsize n);
Modify [istream.unformatted] as shown:
basic_istream& ignore(streamsize n = 1, int_type delim = traits::eof());-25- Effects: Behaves as an unformatted input function (as described above). After constructing a sentry object, extracts characters and discards them. Characters are extracted until any of the following occurs:
(25.1) —
n != numeric_limits<streamsize>::max()([numeric.limits]) andncharacters have been extracted so far ;
(25.2) — end-of-file occurs on the input sequence (in which case the function callssetstate(eofbit), which may throwios_base::failure([iostate.flags]));
(25.3) —traits::eq_int_type(traits::to_int_type(c), delim)for the next available input characterc(in which casecis extracted).[Note 1: The last condition will never occur if
traits::eq_int_type(delim, traits::eof()). — end note]-26- Returns:
*this.
basic_istream& ignore(streamsize n = 1, char_type delim);-?- Effects: Equivalent to:
return ignore(n, traits_type::to_int_type(delim));
Thanks to Iain Sandoe and Ulrich Drepper for inspiring this.
N4971, Working Draft - Programming Languages -- C++, Thomas Köppe, 2023.