"The Tao is constantly moving, the path is always changing." ― Lao Tzu
1. Introduction
[P1636] "Formatters for library types" proposed adding a number of
specializations, including the one for
.
However, SG16 recommended removing it because of quoting and localization
concerns. The current paper addresses these concerns and proposes adding an
improved
specialization for
.
2. Changes from R0
-
Added a reference to [format.string] for the productions fill-and-align and width.
-
Replaced range-format-spec with path-format-spec in the Effects clause of the
function.format -
Added missing transcoding to the definition of the
function.format
3. Problems
[P1636] proposed defining a
specialization for
in terms
of the
insertion operator which, in turn, formats the native
representation wrapped in
. For example:
would printstd :: cout << std :: format ( "{}" , std :: filesystem :: path ( "/usr/bin" ));
"/usr/bin"
with quotes being part of the output.
Unfortunately this has a number of problems, some of them raised in the LWG discussion of the paper.
First,
only escapes the delimiter (
) and the escape character
itself (\
). As a result the output may not be usable if the path contains
control characters such as newlines. For example:
std :: cout << std :: format ( "{}" , std :: filesystem :: path ( "multi \n line" ));
would print
"multi line"which is not a valid string in C++ and many other languages, most importantly including shell languages. Such output is pretty much unusable and interferes with formatting of ranges of paths.
Another problem is encoding. The
member function returns
where
is a
value_type for the operating system dependent encoded character type used to represent pathnames.
typedef
is normally
on POSIX and
on Windows.
This function may perform encoding conversion per [fs.path.type.cvt].
On POSIX, when the target code unit type is
no conversion is normally
performed:
For POSIX-based operating systems
is
path :: value_type so no conversion from
char value type arguments or to
char value type return values is performed.
char
This usually gives the desired result.
On Windows, when the target code unit type is
the encoding conversion
would result in invalid output. For example, trying to print the following path
in Belarusian
std :: ( "{} \n " , std :: filesystem :: path ( L"Шчучыншчына" ));
would result in the following output in the Windows console even though all code pages and localization settings are set to Belarusian and both the source and literal encodings are UTF-8:
"�����������"
The problem is that despite
and
both support Unicode the
intermediate conversion goes through CP1251 (the code page used for Belarusian)
which is not even valid for printing in the console which uses legacy CP866.
This has been discussed at length in [P2093] "Formatted output".
4. Proposal
Both of the problems discussed in the previoius section have already been solved. The escaping mechanism that can handle invalid code units has been introduced in [P2286] "Formatting Ranges" and encoding issues have been addressed in [P2093] and other papers. We apply those solutions to the formatting of paths.
This paper proposes adding a
specialization for
that does
escaping similarly to [P2286] and Unicode transcoding on Windows.
Code | Before | After |
---|---|---|
|
"multi line" |
"multi\nline" |
|
"�����������" |
"Шчучыншчына" |
This leaves only one question of how to handle invalid Unicode. Plain strings handle them by formatting ill-formed code units as hexadecimal escapes, e.g.
// invalid UTF-8, s has value: ["\x{c3}("] std :: string s = std :: format ( "[{:?}]" , " \xc3\x28 " );
This is useful because it doesn’t loose any information. But in case of paths it is a bit more complicated because the string is in a different form and the mapping between ill-formed code units in one form to another may not be well-defined.
The current paper proposes applying hexadecimal escapes to the original ill-formed data because it gives more intuitive result and doesn’t require non-standard mappings such as WTF-8 ([WTF]).
For example:
printsauto p = std :: filesystem :: path ( L" \xd800 " ); // a lone surrogate std :: ( "{} \n " , p );
"\u{d800}"
5. Wording
Add to "Header <filesystem> synopsis" [fs.filesystem.syn]:
// [fs.path.fmt], formatter template < class charT > struct formatter < filesystem :: path , charT > ;
Add a new section "Formatting" [fs.path.fmt] under "Class path" [fs.class.path]:
template < class charT > struct formatter < filesystem :: path , charT > { constexpr format_parse_context :: iterator parse ( format_parse_context & ctx ); template < class FormatContext > typename FormatContext :: iterator format ( const filesystem :: path & path , FormatContext & ctx ) const ; };
constexpr format_parse_context :: iterator parse ( format_parse_context & ctx );
Effects: Parses the format specifier as a path-format-spec and stores the
parsed specifiers in
.
path-format-spec:
fill-and-alignopt widthopt
where the productions fill-and-align and width are described in [format.string].
Returns: An iterator past the end of the path-format-spec.
template < class FormatContext > typename FormatContext :: iterator format ( const filesystem :: path & p , FormatContext & ctx ) const ;
Effects: Writes escaped ([format.string.escaped])
into
,
adjusted according to the path-format-spec.
If
is
,
is
and the literal encoding
is UTF-8 then the escaped path is transcoded from the native encoding for wide
character strings to UTF-8. Otherwise, transcoding is implementation-defined.
Returns: An iterator past the end of the output range.
6. Implementation
The proposed
for
has been implemented in
{fmt} ([FMT]).
7. Acknowledgements
Thanks to Mark de Wever for reviewing an early version of the paper and suggesting a number of fixes and improvements.