[SG16-Unicode] Performance of C interfaces (was: Re: SG16 meeting summary for August 21st, 2019)

JeanHeyd Meneide phdofthehouse at gmail.com
Mon Sep 2 01:00:23 CEST 2019


On Sun, Sep 1, 2019 at 12:07 PM Steve Downey <sdowney at gmail.com> wrote:
>
> That was, if I recall correctly, about the C standard library interfaces in the Null-terminated multibyte strings section. Basically that the character at a time interfaces are not amenable to vectorization.
>

     Yes. The C interfaces for UTFx-to-multi-byte (mbrtoc16, etc.) and
back currently do one-by-one character encoding with a function that
is often hidden behind a DLL function call, or in object code. The
former prevents anything from being done about it, the latter is just
a prayer than LTO can optimize _so well_ that your loop using the
one-by-one codepoint converting functions and turn the whole thing
into a really, really nice loop which converts things very quickly.

     I have not observed this to ever happen, and I'm working on a
benchmarking suite of various methods of conversion that will help
quantify these results in tangible ways.

     With ptr + length, someone can optimize the resulting call as
much as they like. With null-terminated versions of the function, I am
skeptical the same performance can be achieved without first calling
strlen() but I have no experience or data to back up that intuition.

Sincerely,
JeanHeyd Meneide


More information about the Unicode mailing list