"The Tao is constantly moving, the path is always changing." ― Lao Tzu
1. Introduction
[P1636] "Formatters for library types" proposed adding a number of
specializations, including the one for
.
However, SG16 recommended removing it because of quoting and localization
concerns. The current paper addresses these concerns and proposes adding an
improved
specialization for
.
2. Changes from R6
-
Added SG16 poll results for R6.
3. Changes from R5
-
Added generic format support per LWG feedback.
4. Changes from R4
-
Replaced "invalid code units" with a more specific "maximal subparts of ill-formed subsequences" per LEWG feedback.
-
Added LEWG poll results for R4.
5. Changes from R3
-
Added SG16 poll results.
6. Changes from R2
-
Added missing
to the escaping example in Proposal.:? -
Changed the wording around the escaping example to not mention hexadecimal escapes since Unicode escapes may be produced as well.
7. Changes from R1
-
Provided control over escaping via format specifiers per SG16 feedback.
8. Changes from R0
-
Added a reference to [format.string] for the productions fill-and-align and width.
-
Replaced range-format-spec with path-format-spec in the Effects clause of the
function.format -
Added missing transcoding to the definition of the
function.format
9. SG16 Poll Results (R6)
POLL: Forward P2845R6 to LEWG.
Outcome: Unanimous consent to forward.
10. LEWG Poll Results (R4)
POLL: Forward P2845R4 (Formatting of std::filesystem::path) with modified wording for Effects to use the term "replacement of a maximal subpart" to LWG for C++26 to be confirmed with a Library Evolution electronic poll.
SF | F | N | A | SA |
---|---|---|---|---|
11 | 9 | 0 | 0 | 0 |
Outcome: Unanimous consent to forward.
11. SG16 Poll Results (R2)
POLL: Forward P2845R2, Formatting of
, to LEWG with a
recommended target of C++26.
SF | F | N | A | SA |
---|---|---|---|---|
5 | 2 | 1 | 0 | 0 |
Outcome: Strong consensus.
(The poll states P2845R2, but the revision of the paper that was reviewed was a draft of P2845R3 that addressed some minor issues.)
12. Problems
[P1636] proposed defining a
specialization for
in terms
of the
insertion operator which, in turn, formats the native
representation wrapped in
. For example:
std :: cout << std :: format ( "{}" , std :: filesystem :: path ( "/usr/bin" ));
would print
with quotes being part of the output.
Unfortunately this has a number of problems, some of them raised in the LWG discussion of the paper.
First,
only escapes the delimiter (
) and the escape character
itself (\
). As a result the output may not be usable if the path
contains control characters such as newlines. For example:
std :: cout << std :: format ( "{}" , std :: filesystem :: path ( "multi \n line" ));
would print
"multi line"
which is not a valid string in C++ and many other languages, most importantly including shell languages. Such output is pretty much unusable and interferes with formatting of ranges of paths.
Another problem is encoding. The
member function returns
where
is a
value_type for the operating system dependent encoded character type used to represent pathnames.
typedef
is normally
on POSIX and
on Windows.
This function may perform encoding conversion per [fs.path.type.cvt].
On POSIX, when the target code unit type is
no conversion is normally
performed:
For POSIX-based operating systems
is
path :: value_type so no conversion from
char value type arguments or to
char value type return values is performed.
char
This usually gives the desired result.
On Windows, when the target code unit type is
the encoding conversion
would result in invalid output. For example, trying to print the following path
in Belarusian
std :: ( "{} \n " , std :: filesystem :: path ( L"Шчучыншчына" ));
would result in the following output in the Windows console even though all code pages and localization settings are set to Belarusian and both the source and literal encodings are UTF-8:
"�����������"
The problem is that despite
and
both support Unicode the
intermediate conversion goes through CP1251 (the code page used for Belarusian)
which is not even valid for printing in the console which uses legacy CP866.
This has been discussed at length in [P2093] "Formatted output".
13. Proposal
Both of the problems discussed in the previoius section have already been solved. The escaping mechanism that can handle invalid code units has been introduced in [P2286] "Formatting Ranges" and encoding issues have been addressed in [P2093] and other papers. We apply those solutions to the formatting of paths.
This paper proposes adding a
specialization for
that does
escaping similarly to [P2286] and Unicode transcoding on Windows.
Additionally, it proposes giving the user control over escaping via format
specifiers. The debug format (
) gives the escaped representation while the
default is unescaped and minimally processed with only invalid code units
substituted with replacement characters if necessary. This is consistent with
formatting of strings. The default format can be useful for displaying paths in
a UI and gives the user control whether and how to handle special characters.
The debug format is useful for displaying paths as parts of a larger structure
such as a range and prevents interferring with its formatting.
Code | P1636 | This proposal |
---|---|---|
|
"/usr/bin" |
/usr/bin |
|
"multi line" |
multi line |
| ill-formed |
"multi\nline" |
|
"�����������" |
Шчучыншчына |
This leaves only one question of how to handle invalid Unicode. Plain strings handle them by formatting ill-formed code units as hexadecimal escapes, e.g.
// invalid UTF-8, s has value: ["\x{c3}("] std :: string s = std :: format ( "[{:?}]" , " \xc3\x28 " );
This is useful because it doesn’t loose any information. But in case of paths it is a bit more complicated because the string is in a different form and the mapping between ill-formed code units in one form to another may not be well-defined.
When escaping, the current paper proposes applying it to the original ill-formed data because it gives more intuitive result and doesn’t require non-standard mappings such as WTF-8 ([WTF]).
For example:
auto p = std :: filesystem :: path ( L" \xd800 " ); // a lone surrogate std :: ( "{:?} \n " , p );
prints
"\u{d800}"
When not escaping, the paper proposes substituting invalid code units with replacement characters which is the recommended Unicode practice ([UNICODE-SUB]):
For example:
auto p = std :: filesystem :: path ( L" \xd800 " ); // a lone surrogate std :: ( "{} \n " , p );
prints
�
14. Wording
Add to "Header <filesystem> synopsis" [fs.filesystem.syn]:
// [fs.path.fmt], formatter template < class charT > struct formatter < filesystem :: path , charT > ;
Add a new section "Formatting" [fs.path.fmt] under "Class path" [fs.class.path]:
template < class charT > struct formatter < filesystem :: path , charT > { constexpr format_parse_context :: iterator parse ( format_parse_context & ctx ); template < class FormatContext > typename FormatContext :: iterator format ( const filesystem :: path & path , FormatContext & ctx ) const ; };
is debug-enabled ([format.formatter.spec]).
constexpr format_parse_context :: iterator parse ( format_parse_context & ctx );
Effects: Parses the format specifier as a path-format-spec and stores the
parsed specifiers in
.
path-format-spec:
fill-and-alignopt widthopt
opt
opt
where the productions fill-and-align and width are described in [format.string]. If the
option is
used then the path is formatted as an escaped string ([format.string.escaped]).
Returns: An iterator past the end of the path-format-spec.
template < class FormatContext > typename FormatContext :: iterator format ( const filesystem :: path & p , FormatContext & ctx ) const ;
Effects: Let
be
if the
option is used, otherwise
. Writes
into
, adjusted according to the path-format-spec. If
is
,
is
and the
literal encoding is UTF-8 then the escaped path is transcoded from the native
encoding for wide character strings to UTF-8 with maximal subparts of ill-formed
subsequences substituted with U+FFFD REPLACEMENT CHARACTER per the Unicode
Standard, Chapter 3.9 U+FFFD Substitution in Conversion.
If
and
are the same then no transcoding is performed.
Otherwise, transcoding is implementation-defined.
Returns: An iterator past the end of the output range.
15. Implementation
The proposed
for
has been implemented in the
open-source {fmt} library ([FMT]).
16. Acknowledgements
Thanks to Mark de Wever, Roger Orr and Tom Honermann for reviewing an early version of the paper and suggesting a number of fixes and improvements. Thanks Jonathan Wakely for wording suggestions.