<div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">On Wed, May 30, 2018 at 7:36 PM, ThePhD <span dir="ltr"><<a href="mailto:phdofthehouse@gmail.com" target="_blank">phdofthehouse@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div><div>I've been bikeshedding and looking at a wide variety of languages and implementations of text. Many still use code units and provide iterators to code units as its default level of abstraction. Others provide code points, and a few newer languages and libraries provide grapheme clusters.<br><br></div>I'm beginning to think `std::text` -- in whatever form it takes -- should include no defaults and instead just pack itself with member functions such as `.codepoints()`, `.graphemes(/*options*/)`, `.words(/*options*/)` and let the user decide at what level they want to be working. I think encoding and normalization should be part of the type name, because those are the two very important, but after that we should simply be handing out views and letting people pick whatever abstraction level they want.<br></div></div></div></div></blockquote><div><br></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">There is no compelling reason for those things to be members. Boost.Text will have all that functionality (and mostly does already), separated out as free-function algorithms. I find that this works quite well.</span><br></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">Moreover, a text-type is a sequence of something. Or rather, I can't get much use out of it if it isn't. So, what is it a sequence of? I think that graphemes, code points, and code units are the most essential units of work that one might want to use. However, a sequence of exactly *one* of these should be the kind of range that a text type models.</span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline"><br></span></div><div><span style="color:rgb(34,34,34);font-family:arial,sans-serif;font-size:small;font-style:normal;font-variant-ligatures:normal;font-variant-caps:normal;font-weight:400;letter-spacing:normal;text-align:start;text-indent:0px;text-transform:none;white-space:normal;word-spacing:0px;background-color:rgb(255,255,255);text-decoration-style:initial;text-decoration-color:initial;float:none;display:inline">I think most users that work with text will want graphemes as the essential view of text. That may prove to be untrue.</span></div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div><div></div>At the very least, it means that we don't arm the wrong gun for the user by accident / by default. And at some level, the user will have to read through even the synopsis of what each of these functions will bring to them. Some are obvious (words, sentences, etc. text segmentation algorithms), but others might require some thinking (codepoints, graphemes in particular).<br></div></div></div></blockquote><div><br></div><div>Not picking one is a lot worse. One of the many problems with Unicode is that it is too damn complicated for experienced users, much less new ones. We need types with the right default so that people can just pick up a new version of their compiler and get to work without taking a week first to do research. To the extent possible, it should "just work."</div><div> </div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div></div>I can understand that not having defaults could be a frustrating experience to start with, however. I feel like it's justifiable given that there's no 100% right answer. Maybe we can get 80% but it will only make the 20% more surprising. I feel like having docs / tables that describe the various segmentation algorithms, what they bring to table, and what the user can get out of it might be more worth while.<br><br></div>Is this a viable path?<br></div>
</blockquote></div><br></div><div class="gmail_extra">I don't think it is. Again, the future reactions to Boost.Text will bear that out (or not!).</div><div class="gmail_extra"><br></div><div class="gmail_extra">Zach</div><div class="gmail_extra"><br></div></div>