Formatting Ranges

Document #: P2286R1
Date: 2021-02-19
Project: Programming Language C++
Audience: LEWG
Reply-to: Barry Revzin
<>

1 Revision History

[P2286R0] suggested making all the formatting implementation-defined. Several people reached out to me suggesting in no uncertain terms that this is unacceptable. This revision lays out options for such formatting.

2 Introduction

[LWG3478] addresses the issue of what happens when you split a string and the last character in the string is the delimiter that you are splitting on. One of the things I wanted to look at in research in that issue is: what do other languages do here?

For most languages, this is a pretty easy proposition. Do the split, print the results. This is usually only a few lines of code.

Python:

print("xyx".split("x"))

outputs

['', 'y', '']

Java (where the obvious thing prints something useless, but there’s a non-obvious thing that is useful):

import java.util.Arrays;

class Main {  
  public static void main(String args[]) { 
    System.out.println("xyx".split("x"));
    System.out.println(Arrays.toString("xyx".split("x")));
  } 
}

outputs

[Ljava.lang.String;@76ed5528
[, y]

Rust (a couple options, including also another false friend):

use itertools::Itertools;

fn main() {
    println!("{:?}", "xyx".split('x'));
    println!("[{}]", "xyx".split('x').format(", "));
    println!("{:?}", "xyx".split('x').collect::<Vec<_>>());
}

outputs

Split(SplitInternal { start: 0, end: 3, matcher: CharSearcher { haystack: "xyx", finger: 0, finger_back: 3, needle: 'x', utf8_size: 1, utf8_encoded: [120, 0, 0, 0] }, allow_trailing_empty: true, finished: false })
[, y, ]
["", "y", ""]

Kotlin:

fun main() {
    println("xyx".split("x"));
}

outputs

[, y, ]

Go:

package main
import "fmt"
import "strings"

func main() {
    fmt.Println(strings.Split("xyx", "x"));
}

outputs

[ y ]

JavaScript:

console.log('xyx'.split('x'))

outputs

[ '', 'y', '' ]

And so on and so forth. What we see across these languages is that printing the result of split is pretty easy. In most cases, whatever the print mechanism is just works and does something meaningful. In other cases, printing gave me something other than what I wanted but some other easy, provided mechanism for doing so.

Now let’s consider C++.

#include <iostream>
#include <string>
#include <ranges>
#include <format>

int main() {
    // need to predeclare this because we can't split an rvalue string
    std::string s = "xyx";
    auto parts = s | std::views::split('x');
    
    // nope
    std::cout << parts;
    
    // nope (assuming std::print from P2093)
    std::print("{}", parts);
    
    
    std::cout << "[";
    char const* delim = "";
    for (auto part : parts) {
        std::cout << delim;
        
        // still nope
        std::cout << part;
        
        // also nope
        std::print("{}", part);
        
        // this finally works
        std::ranges::copy(part, std::ostream_iterator<char>(std::cout));
        
        // as does this
        for (char c : part) {
            std::cout << c;
        }
        delim = ", ";
    }
    std::cout << "]\n";
}

This took me more time to write than any of the solutions in any of the other languages. Including the Go solution, which contains 100% of all the lines of Go I’ve written in my life.

Printing is a fairly fundamental and universal mechanism to see what’s going on in your program. In the context of ranges, it’s probably the most useful way to see and understand what the various range adapters actually do. But none of these things provides an operator<< (for std::cout) or a formatter specialization (for format). And the further problem is that as a user, I can’t even do anything about this. I can’t just provide an operator<< in namespace std or a very broad specialization of formatter - none of these are program-defined types, so it’s just asking for clashes once you start dealing with bigger programs.

The only mechanisms I have at my disposal to print something like this is either

  1. nested loops with hand-written delimiter handling (which are tedious and a bad solution), or
  2. at least replace the inner-most loop with a ranges::copy into an output iterator (which is more differently bad), or
  3. Write my own formatting library that I am allowed to specialize (which is not only bad but also ridiculous)
  4. Use fmt::format.

2.1 Implementation Experience

That’s right, there’s a fourth option for C++ that I haven’t shown yet, and that’s this:

#include <ranges>
#include <string>
#include <fmt/ranges.h>

int main() {
    std::string s = "xyx";
    auto parts = s | std::views::split('x');

    fmt::print("{}\n", parts);
    fmt::print("[{}]\n", fmt::join(parts, ","));
}

outputting

{{}, {'y'}}
[{},{'y'}]

And this is great! It’s a single, easy line of code to just print arbitrary ranges (include ranges of ranges).

And, if I want to do something more involved, there’s also fmt::join, which lets me specify both a format specifier and a delimiter. For instance:

std::vector<uint8_t> mac = {0xaa, 0xbb, 0xcc, 0xdd, 0xee, 0xff};
fmt::print("{:02x}\n", fmt::join(mac, ":"));

outputs

aa:bb:cc:dd:ee:ff

fmt::format (and fmt::print) solves my problem completely. std::format does not, and it should.

3 Proposal Considerations

The Ranges Plan for C++23 [P2214R0] listed as one of its top priorities for C++23 as the ability to format all views. Let’s go through the issues we need to address in order to get this functionality.

3.1 What types to print?

The standard library is the only library that can provide formatting support for standard library types and other broad classes of types like ranges. In addition to ranges (both the conrete containers like vector<T> and the range adaptors like views::split), there are several very commonly used types that are currently not printable.

The most common and important such types are pair and tuple (which ties back into Ranges even more closely once we adopt views::zip and views::enumerate). fmt currently supports printing such types as well:

fmt::print("{}\n", std::pair(1, 2));

outputs

(1, 2)

Another common and important set of types are std::optional<T> and std::variant<Ts...>. fmt does not support printing any of the sum types. There is not an obvious representation for them in C++ as there might be in other languages (e.g. in Rust, an Option<i32> prints as either Some(42) or None, which is also the same syntax used to construct them).

However, the point here isn’t necessarily to produce the best possible representation (users who have very specific formatting needs will need to write custom code anyway), but rather to provide something useful. And it’d be useful to print these types as well. However, given that optional and variant are both less closely related to Ranges than pair and tuple and also have less obvious representation, they are less important.

3.2 What representation?

There are several questions to ask about what the representation should be for printing.

  1. Should vector<int>{1, 2, 3} be printed as {1, 2, 3} (as fmt currently does and as the type is constructed in C++) or [1, 2, 3] (as is typical representationally outside of C++)?
  2. Should pair<int, int>{3, 4} be printed as {3, 4} (as the type is constructed) or as (3, 4) (as fmt currently does and is typical representationally outside of C++)?
  3. Should char and string in the context of ranges and tuples be printed as quoted (as fmt currently does) or unquoted (as these types are typically formatted)?

What I’m proposing is the following:

std::vector<int> v = {1, 2, 3};
std::map<std::string, char> m = {{"hello", 'h'}, {"world", 'w'}};

std::print("v={}. m={}\n", v, m);

print:

v=[1, 2, 3]. m=[("hello", 'h'), ("world", 'w')]

That is: ranges are surrounded with []s and delimited with ", ". Pairs and tuples are surrounded with ()s and delimited with ", ". Types like char, string, and string_view are printed quoted.

It is more important to me that ranges and tuples are visually distinct (in this case [] vs (), but the way that fmt currently does it as {} vs () is also okay) than it would be to quote the string-y types. My rank order of the possible options for the map m above is:

Ranges Tuples Quoted? Formatted Result
[] () ✔️ [("hello", 'h'), ("world", 'w')]
{} () ✔️ {("hello", 'h'), ("world", 'w')}
[] () [(hello, h), (world, w)]
{} () {(hello, h), (world, w)}
{} {} {{hello, h}, {world, w}}

My preference for avoiding {} in the formatting is largely because it’s unlikely the results here can be used directly for copying and pasting directly into initialization anyway, so the priority is simply having visual distinction for the various cases.

With quoting, the question is how does the library choose if something is string-like and thus needs to be quoted. Currently, fmtlib makes this determination by checking for the presence of certain operations (p->length(), p->find('a'), and p->data()). We could instead add a variable template called enable_formatting_as_string to allow for opting into this kind of formatting.

3.3 What additional functionality?

There’s three layers of potential functionality:

  1. Top-level printing of ranges: this is fmt::print("{}", r);

  2. A format-joiner which allows providing a format specifier for each element and a delimiter: this is fmt::print("{:02x}", fmt::join(r, ":")).

  3. A more involved version of a format-joiner which takes a delimiter and a callback that gets invoked on each element. fmt does not provide such a mechanism, though the Rust itertools library does:

    let matrix = [[1., 2., 3.],
              [4., 5., 6.]];
    let matrix_formatter = matrix.iter().format_with("\n", |row, f| {
                                    f(&row.iter().format_with(", ", |elt, g| g(&elt)))
                                 });
    assert_eq!(format!("{}", matrix_formatter),
               "1, 2, 3\n4, 5, 6");

This paper suggests the first two and encourages research into the third.

3.4 What about format specifiers?

The implementation experience in fmt is that directly formatting ranges does not support any format specifiers, but fmt::join supports providing a specifier per element as well as providing the delimiter and wrapping brackets.

We could add the same format specifier support for direct formatting of ranges as fmt::join supports, but it doesn’t seem especially worthwhile. If you don’t care about formatting, "{}" is all you need. If you do care about formatting, it’s likely that you care about more than just the formatting of each individual element — you probably care about other things do. At which point, you’d likely need to use fmt::join anyway.

That seems like the right mix of functionality to me.

3.5 How to support those views which are not const-iterable?

There are several C++20 range adapters which are not const-iterable. These include views::filter and views::drop_while. But std::format takes its arguments by Args const&..., so how could they be printable?

fmt handles this just fine even with its formatter specialization taking by const by converting the range to a view first.

So we can do something like:

template <range R, typename Char>
struct formatter<R, Char>
{
    template <typename FormatContext>
    auto format(R const& rng, FormatContext& ctx) {
        auto values = std::views::all(rng);
        // ...
    }
};

But, this still doesn’t cover all the cases. If R is a type such that R const is a view but R is move-only, that won’t compile. We can work around this case. But if R is a type that such that view<R> and not range<R const> and not copyable<R>, there’s really no way of getting this to work without changing the API.

Notably, the proposed std::generator<T> is one such type [P2168R0]

If we do want to support this case (and fmt does not), then we will need to change the API of format to take by forwarding reference instead of by const reference. That is, replace

template<class... Args>
string format(string_view fmt, const Args&... args);

with:

template<class... Args>
string format(string_view fmt, Args&&... args);

But it’s not that easy. While we would need std::generator<T> to bind to Args&& rather than Args const&, there are other cases which have the exact opposite problem: bit-fields need to bind to Args const& and cannot bind to Args&&.

So we can’t really change the API to support this use-case without breaking other use-cases. But there are workarounds for this case:

For example:

auto ints_coro(int n) -> std::generator<int> {
    for (int i = 0; i < n; ++i) {
        co_yield i;
    }
}

fmt::print("{}", ints_coro(10)); // error
fmt::print("[{}]", fmt::join(ints_coro(10), ", ")); // ok

auto v = ints_coro(10);
fmt::print("{}", std::ranges::ref_view(v)); // ok, if tedious

fmt::print("{}", ints_coro(10) | ranges::to<std::vector>); // ok, but expensive
fmt::print("{}", views::iota(0, 10)); // ok, but harder to implement

You can see an example of formatting a move-only, non-const-iterable range on compiler explorer [fmt-non-const].

3.6 Specifying formatters for ranges

It’s quite important that a std::string whose value is "hello" gets printed as hello rather than something like [h, e, l, l, o].

This would basically fall out no matter how we approach implementing such a thing, so in of itself is not much of a concern. However, for users who have either custom containers or want to customize formatting of a standard container for their own types, they need to make sure that they can provide a specialization which is more constrained than the standard library’s for ranges. To ensure that they can do that, I think we need to be clear about the specific constraint we use when we specify this.

3.7 format or std::cout?

Just format is sufficient.

3.8 What about vector<bool>?

Nobody expected this section.

The value_type of this range is bool, which is formattable. But the reference type of this range is vector<bool>::reference, which is not. In order to make the whole type formattable, we can either make vector<bool>::reference formattable (and thus, in general, a range is formattable if both its value and reference types are formattable) or allow formatting to fall back to constructing a value for each reference (and thus, in general, a range is formattable if at least its value is).

For most ranges, the value_type is remove_cvref_t<reference>, so there’s no distinction here between the two options. And even for zip, there’s still not much distinction since it just wraps this question in tuple since again for most ranges the types will be something like tuple<T, U> vs tuple<T&, U const&>, so again there isn’t much distinction.

vector<bool> is one of the very few ranges in which the two types are truly quite different. So it doesn’t offer much in the way of a good example here, since bool is cheaply constructible from vector<bool>::reference. Though it’s also very cheap to provide a formatter specialization for vector<bool>::reference. We might as well.

4 Proposal

The standard library should add specializations of formatter for:

Ranges should be formatted as [x, y, z] while tuples should be formatted as (a, b, c). For types that satisfy both (e.g. std::array), they’re treated as ranges. In the context of formatting ranges, types that are string-like (e.g. char, string, string_view) should be formatted as being quoted (with string-like being determined via variable template trait).

Formatting ranges does not support any additional format specifiers.

The standard library should also add a utility std::format_join (or any other suitable name, knowing that std::views::join already exists), following in the footsteps of fmt::join, which allows the user to provide more customization in how ranges and tuples get formatted.

For types like std::generator<T> (which are move-only, non-const-iterable ranges), users will have to either use std::format_join facility or use something like ranges::ref_view as shown earlier.

5 References

[fmt-non-const] Barry Revzin. 2021. Formatting a move-only non-const-iterable range.
https://godbolt.org/z/149YqW

[LWG3478] Barry Revzin. views::split drops trailing empty range.
https://wg21.link/lwg3478

[P2168R0] Corentin Jabot, Lewis Baker. 2020-05-16. generator: A Synchronous Coroutine Generator Compatible With Ranges.
https://wg21.link/p2168r0

[P2214R0] Barry Revzin, Conor Hoekstra, Tim Song. 2020-10-15. A Plan for C++23 Ranges.
https://wg21.link/p2214r0

[P2286R0] Barry Revzin. 2021-01-15. Formatting Ranges.
https://wg21.link/p2286r0