[SG16-Unicode] SG16 approval for LEWG to review std::filesystem::path_view

Lyberta lyberta at lyberta.net
Fri Jul 5 03:37:00 CEST 2019


Niall Douglas:
> Thing is, can you name a real world situation where reading the byte(s)
> after the end of a path character range would blow up?

Running under valgrind.

> 
> Remember, these are filesystem paths. They don't have the diversity of
> sources that a string_view would have. The chances, for example, of a
> path_view being constructed from a memory mapped region where the tail
> byte is exactly at the end of the mapped region is virtually nil. Any
> reasonably likely generation of path data is going to, at worst, have
> the character after the input be indeterminate, and not a SIGSEGV to
> read. And the standard library can legally do stuff banned in end user
> code, such as reading indeterminate bytes. This restricted kinds of
> input would not be the case for string_view, where wrapping a whole 4Kb
> page into a string_view is an eminently sensible thing to do.

Again, this assumes some explicit anti UB system. Under normal
circumstances I would assume that compiler will see UB and do some
"optimizations" that can totally blow up.

> 
> And besides, this is a *documentation* thing. If the API documentation
> says "the user must guarantee that the character after is readable",
> then violating that is on the user. We can even add it as a contract
> precondition. I think that's okay, personally. It's in the same category
> as vector::operator[](vector::size()). Just don't do that.

I think any kind of documentation that is not enforced by the type
system is a very bad idea.

How hard it is to require users who want maximum performance to #ifdef
for their platform and manually NUL-terminate their path_views? The rest
99% of users will rely on current ranges semantics.

> It's at least a decade or more away. POSIX wants C to implement strings
> properly first. C are still umming and ahhing about the best design for
> built in string objects, and Martin Uecker is working on a formal
> proposal for that.

Getting into slight offtopic but in C I expect this to be almost perfect
(minus some naming):

struct utf8_string
{
	size_t size;
	char8_t* buffer;
};

struct utf16_string
{
	size_t size;
	char16_t* buffer;
};

struct utf32_string
{
	size_t size;
	char32_t* buffer;
};

Of course, that would pollute std:: namespace in C++. I hope we can get
C library relocated to somewhere like std::c. That way both C and C++
can get good names.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: OpenPGP digital signature
Url : http://www.open-std.org/pipermail/unicode/attachments/20190705/60c6bc01/attachment.bin 


More information about the Unicode mailing list