[SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path_view
Tom Honermann
tom at honermann.net
Sat Jul 6 23:59:17 CEST 2019
On 7/4/19 1:01 PM, Niall Douglas wrote:
> On 04/07/2019 17:23, Lyberta wrote:
> Thing is, can you name a real world situation where reading the byte(s)
> after the end of a path character range would blow up?
>
> Remember, these are filesystem paths. They don't have the diversity of
> sources that a string_view would have. The chances, for example, of a
> path_view being constructed from a memory mapped region where the tail
> byte is exactly at the end of the mapped region is virtually nil. Any
> reasonably likely generation of path data is going to, at worst, have
> the character after the input be indeterminate, and not a SIGSEGV to
> read. And the standard library can legally do stuff banned in end user
> code, such as reading indeterminate bytes. This restricted kinds of
> input would not be the case for string_view, where wrapping a whole 4Kb
> page into a string_view is an eminently sensible thing to do.
Unfortunately, we don't have means to audit the world wide code base to
determine what programmers do and don't do.
I don't find it at all unlikely that string_view instances will be
implicitly constructed from temporaries of string types that don't
provide a null terminator.
> And besides, this is a *documentation* thing. If the API documentation
> says "the user must guarantee that the character after is readable",
> then violating that is on the user. We can even add it as a contract
> precondition. I think that's okay, personally. It's in the same category
> as vector::operator[](vector::size()). Just don't do that.
I don't think this is ok as it is inconsistent with user expectations.
I'm sensing a contradiction here as well. You have been advocating for
omitting a char based interface because programmers sometimes use it
incorrectly, but here you are claiming that documenting "don't do that"
is sufficient.
> It's at least a decade or more away. POSIX wants C to implement strings
> properly first. C are still umming and ahhing about the best design for
> built in string objects, and Martin Uecker is working on a formal
> proposal for that.
>
> They're actually very much currently stuck on whether built-in C strings
> ought to be bags of chars, or always in UTF-8. Committee is split right
> down the middle on that. I suggested to them that they first formalise
> dynamic array objects, then build a UTF-8 string object on top, then
> everybody is happy. I pointed them at Zach's C++ UTF-8 string library
> for study.
I follow WG14 loosely, but haven't seen any proposals in this area so
far. Is there any existing work you can point me towards?
Tom.
More information about the Unicode
mailing list