<br><br><div class="gmail_quote"><div dir="ltr">On Sun, Apr 28, 2019, 10:01 PM <<a href="mailto:keld@keldix.com">keld@keldix.com</a>> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">On Sun, Apr 28, 2019 at 11:04:58AM +0300, Henri Sivonen wrote:<br>
> On Sat, Apr 27, 2019 at 2:15 PM Lyberta <<a href="mailto:lyberta@lyberta.net" target="_blank">lyberta@lyberta.net</a>> wrote:<br>
> > Where is SIMD is applicable?<br>
> <br>
> The most common use cases are skipping over ASCII in operations where<br>
> ASCII is neutral and adding leading zeros or removing leading zeros<br>
> when converting between different code unit widths. However, there are<br>
> other operations, not all of them a priori obvious, that can benefit<br>
> from SIMD. For example, I've used SIMD to implement a check for<br>
> whether text is guaranteed-left-to-right or potentially-bidirectional.<br>
> <br>
> > Ranges are generalization of std::span. Since no major compiler<br>
> > implements them right now, nobody except authors of ranges is properly<br>
> > familiar with them.<br>
> <br>
> If a function takes a ContiguousRange and is called with two different<br>
> concrete argument types in two places of the program, does the binary<br>
> end up with one copy of the function or two copies? That is, do Ranges<br>
> monomorphize per concrete type?<br>
> <br>
> > For transcoding you don't need contiguous memory and<br>
> > with Ranges you can do transcoding straight from and to I/O using<br>
> > InputRange and OutputRange. Not sure how useful in practice, but why<br>
> > prohibiting it outright?<br>
> <br>
> For the use case I designed for, the converter wasn't allowed to pull<br>
> from the input stream but instead the I/O subsystem hands the<br>
> converter buffers and the event loop potentially spins between buffers<br>
> arriving. At the very least it would be prudent to allow for designs<br>
> where the conversion is suspended in such a way while the event loop<br>
> spins. I don't know if this means anything for evaluating Ranges.<br>
> <br>
> > From what I know only 8 bit, 16 bit and 32 bit byte systems actually<br>
> > support modern C++.<br>
> <br>
> Do systems with 16-bit or 32-bit bytes need to process text, or are<br>
> they used for image/video/audio processing only?<br>
> <br>
> On Sat, Apr 27, 2019 at 3:01 PM Ville Voutilainen<br>
> <<a href="mailto:ville.voutilainen@gmail.com" target="_blank">ville.voutilainen@gmail.com</a>> wrote:<br>
> ><br>
> > On Sat, 27 Apr 2019 at 13:28, Henri Sivonen <<a href="mailto:hsivonen@hsivonen.fi" target="_blank">hsivonen@hsivonen.fi</a>> wrote:<br>
> > > Having types that enforce Unicode validity can be very useful when the<br>
> > > language has good mechanisms for encapsulating the enforcement and for<br>
> > > clearly marking cases where for performance reasons the responsibility<br>
> > > of upholding the invariance is transferred from the type<br>
> > > implementation to the programmer. This kind of thing requires broad<br>
> > > vision and buy-in from the standard library.<br>
> > ><br>
> > > Considering that the committee has recently<br>
> > > * Added std::u8string without UTF-8 validity enforcement<br>
> > > * Added std::optional in such a form that the most ergonomic way of<br>
> > > extracting the value, operator*(), is unchecked<br>
> > > * Added std::span in a form that, relative to gsl::span, removes<br>
> > > safety checks from the most ergonomic way of indexing into the span,<br>
> > > operator[]()<br>
> > > what reason is there to believe that validity-enforcing Unicode types<br>
> > > could make it through the committee?<br>
> ><br>
> > Both std::optional and std::span provide 'safe' ways for extracting<br>
> > and indexing.<br>
> > The fact that the most-ergonomic way of performing those operations is<br>
> > zero-overhead<br>
> > rather than 'safe' should be of no surprise to anyone.<br>
> <br>
> Indeed, I'm saying that the pattern suggests that unchecked-by-default<br>
> is what the committee consistently goes with, so I'm not suggesting<br>
> that anyone be surprised.<br>
> <br>
> > The reason to<br>
> > 'believe' that<br>
> > validity-enforcing Unicode types could make it through the committee depends<br>
> > on the rationale for such types, not on strawman arguments about<br>
> > things completely<br>
> > unrelated to the success of proposals for such types.<br>
> <br>
> The pattern of unchecked-byte-default suggests that it's unlikely that<br>
> validity-enforcing Unicode types could gain pervasive buy-in<br>
> throughout the standard library and that the unchecked types could<br>
> fall out of use in practice. Having validity-enforcing Unicode types<br>
> _in addition to_ unchecked Unicode types is considerably less valuable<br>
> and possibly even anti-useful compared to only having<br>
> validity-enforcing types or only having unchecked types.<br>
> <br>
> For example, consider some function taking a view of guaranteed-valid<br>
> UTF-8 and what you have is std::u8string_view that you got from<br>
> somewhere else. That situation does not compose well if you need to<br>
> pass the possibly-invalid view to an API that takes a guaranteed-valid<br>
> view. The value of guaranteed-valid views is lost if you end up doing<br>
> validation in random places instead of UTF-8 validation having been<br>
> consistently pushed to the I/O boundary such that everything inside<br>
> the application uses guaranteed-valid views.<br>
> <br>
> (Being able to emit the error condition branch when iterating over<br>
> UTF-8 by scalar value is not the only benefit of guaranteed-valid<br>
> UTF-8 views. If you can assume UTF-8 to be valid, you can also use<br>
> SIMD in ways that check for the presence of lead bytes in certain<br>
> ranges without having to worry about invalid sequences fooling such<br>
> checks. Either way, if you often end up validating the whole view<br>
> immediately before performing such an operation, the validation<br>
> operation followed by the optimized operation is probably less<br>
> efficient than just performing a single-pass operation that can deal<br>
> with invalid sequences.)<br>
> <br>
> On Sat, Apr 27, 2019 at 3:13 PM Tom Honermann <<a href="mailto:tom@honermann.net" target="_blank">tom@honermann.net</a>> wrote:<br>
> ><br>
> > On 4/27/19 6:28 AM, Henri Sivonen wrote:<br>
> > > I'm happy to see that so far there has not been opposition to the core<br>
> > > point on my write-up: Not adding new features for non-UTF execution<br>
> > > encodings. With that, let's talk about the details.<br>
> ><br>
> > I see no need to take a strong stance against adding such new features.<br>
> > If there is consensus that a feature is useful (at least to some subset<br>
> > of users), implementors are not opposed,<br>
> <br>
> On the flip side are there implementors who have expressed interest in<br>
> implementing _new_ text algorithms that are not in terms of Unicode?<br>
> <br>
> > and the feature won't<br>
> > complicate further language evolution, then I see no reason to be<br>
> > opposed to it.<br>
> <br>
> Text_view as proposed complicates language evolution for the sake of<br>
> non-Unicode numberings of abstract characters by making the "character<br>
> type" abstract.<br>
> <br>
> >There are, and will be for a long time to come, programs<br>
> > that do not require Unicode and that need to operate in non-Unicode<br>
> > environments.<br>
> <br>
> How seriously do such programs need _new_ text processing facilities<br>
> from the standard library?<br>
> <br>
> On Sat, Apr 27, 2019 at 7:43 PM JeanHeyd Meneide<br>
> <<a href="mailto:phdofthehouse@gmail.com" target="_blank">phdofthehouse@gmail.com</a>> wrote:<br>
> > By now, people who are using non-UTF encodings have already rolled their own libraries for it: they can continue to use those libraries. The standard need not promise arbitrary range-based to_lower/to_upper/casefold/etc. based on wchar_t and char_t: those are dead ends.<br>
> <br>
> Indeed.<br>
> <br>
> > I am strongly opposed to ALL encodings taking std::byte as the code unit. This interface means that implementers must now be explicitly concerned with endianness for anything that uses code units wider than 8 bits and is a multiple of 2 (UTF16 and UTF32). We work with the natural width and endianness of the machine by using the natural char8_t, char16_t, and char32_t. If someone wants bytes in / bytes out, we should provide encoding-form wrappers that put it in Little Endian or Big Endian on explicit request:<br>
> ><br>
> > encoding_form<utf16, little_endian> ef{}; // a wrapper that makes it so it works on a byte-by-byte basis, with the specified endianness<br>
> <br>
> I think it is a design error to try to accommodate UTF-16 or UTF-32 as<br>
> Unicode Encoding Forms in the same API position as Unicode Encoding<br>
> Schemes and other encodings. Converting to/from byte-oriented I/O or<br>
> narrow execution encoding is a distinct concern from converting<br>
> between Unicode Encoding Forms within the application. Notably, the<br>
> latter operation is less likely to need streaming.<br>
> <br>
> Providing a conversion API for non-UTF wchar_t makes the distinction<br>
> less clear, though. Again, that's the case of z/OS causing abstraction<br>
> obfuscation for everyone else. :-(<br>
> <br>
> On Sat, Apr 27, 2019 at 2:59 PM <<a href="mailto:keld@keldix.com" target="_blank">keld@keldix.com</a>> wrote:<br>
> ><br>
> > well, I am much against leaving the principle of character set neutrality in c++,<br>
> > and I am working to enhance cheracter set features in a pan-character set way<br>
> <br>
> But why? Do you foresee a replacement for Unicode for which<br>
> non-commitment to Unicode needs to be kept alive? What value is there<br>
> from pretending, on principle, that Unicode didn't win with no<br>
> realistic avenue for getting replaced--especially when other<br>
> programming languages, major GUI toolkits, and the Web Platform have<br>
> committed to the model where all text is conceptually (and<br>
> implementation-wise internally) Unicode but may be interchanged in<br>
> legacy _encodings_?<br>
<br>
I believe there are a number of encodings in East Asia that there will still be<br>
developed for for quite some time.<br>
<br>
major languages and toolkits and operating systems are still character set independent.<br>
some people believe that unicode has not won</blockquote></div><div><br></div><div>Some people are wrong</div><div><br></div><div> and some people are not happy with</div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
the unicode consortium. </blockquote></div><div><br></div><div>Some people will never be happy. Yet it is incredibly unlikely that someone would come up with a set of characters which is a strict superset of what is offered by Unicode, and nothing short of that would make it suitable to handle text.</div><div><br></div><div>Operating systems that are encoding independent are mostly a myth at this point. Probably always were. Linux is mostly utf-8, Osx is Unicode, windows is slowly getting there etc.</div><div><br></div><div>All of that is driven by marker forces. Users don't tolerate mojibake and the _only_ way to avoid that is to use Unicode.</div><div><br></div><div>This means in no way that c++ wouldn't be able to transcode inputs from all kind of encoding at i/o boundary.</div><div><br></div><div><br></div><div><br></div><div><br></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">why abandon a model that still delivers for all?</blockquote></div><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">
<br>
keld<br>
<br>
_______________________________________________<br>
SG16 Unicode mailing list<br>
<a href="mailto:Unicode@isocpp.open-std.org" target="_blank">Unicode@isocpp.open-std.org</a><br>
<a href="http://www.open-std.org/mailman/listinfo/unicode" rel="noreferrer" target="_blank">http://www.open-std.org/mailman/listinfo/unicode</a><br>
</blockquote></div>