std::hash<X>
hash_append
?
hash_append
for vector<T, A>
hash_append
for std::pair<T, U>
hash_append
for int
is_contiguously_hashable<T>
:hash_append
the same thing as boost::hash_combine
?hash_append
the same thing as serialization?hash_append
?hash_append
Pimpl designs?hash_combine
solution?type_erased_hasher
debugHasher
This paper proposes a new hashing infrastructure that completely decouples hashing algorithms from individual types that need to be hashed. This decoupling divides the hashing computation among 3 different programmers who need not coordinate with each other:
Authors of hashable types (keys of type K
) write their hashing
support just once, using no specific hashing algorithm. This code resembles
(and is approximately the same amount of work as) operator==
and
swap
for a type.
Authors of hashing algorithms write a functor (e.g. H
) that
operates on a contiguous chunk of generic memory, represented by a void
const*
and a number of bytes. This code has no concept of a specific key
type, only of bytes to be hashed.
Clients who want to hash keys of type K
using hashing algorithm
H
will form a functor of type std::uhash<H>
to
give to an unordered container.
unordered_set<K, uhash<H>> my_set;
Naturally, there could be a default hashing algorithm supplied by the std::lib:
unordered_set<K, uhash<>> my_set;
To start off with, we emphasize: there is nothing in this proposal that changes
the existing std::hash
, or the unordered containers. And there is
also nothing in this proposal that would prohibit the committee from standardizing
both this proposal, and either one of
N3333
or
N3876.
N3333
and
N3876
contradict each other, and thus compete with each other. Both cannot be
standardized. This proposal, on the other hand, addresses a problem not addressed
by
N3333
or
N3876.
Nor does this proposal depend upon anything in
N3333
or
N3876.
This paper simply takes a completely different approach to producing hash
codes from types, in order to solve a problem that was beyond the scope of
N3333
and
N3876.
The problem solved herein is how to support the hashing of N different types of
keys using M different hashing algorithms, using an amount of source code that
is proportional to N+M, as opposed to the current system based on
std::hash<T>
which requires an amount of source code
proportional to N*M. And consequently in practice today M==1, and the single
hashing algorithm is supplied only by the std::lib implementor. As it is too
difficult and error prone for the client to supply alternative algorithms for
all of the built-in scalar types (int
, long
,
double
, etc.). Indeed, it has even been too difficult for the
committee to supply hashing support for all of the types our clients might
reasonably want to use as keys: pair
, tuple
,
vector
, complex
, duration
,
forward_list
etc.
This paper makes ubiquitous hash support for most types as easy and as
practical as is today's support for swap
and
operator==
.
This paper starts with an assertion:
Types should not know how to hash themselves.
The rest of this paper begins with demonstrating the problems created when software systems assume that types do know how to hash themselves, and what can be done to solve these problems.
Instead of starting with a basic example like std::string
or
int
, this paper will introduce an example class X
that is meant to be representative of a type that a programmer would write,
and would want to create a hash code for:
namespace mine { class X { std::tuple<short, unsigned char, unsigned char> date_; std::vector<std::pair<int, int>> data_; public: X(); // ... friend bool operator==(X const& x, X const& y) { return std::tie(x.date_, x.data_) == std::tie(y.date_, y.data_); } }; } // mine
How do we write the hash function for X?
std::hash<X>
If we standardize
N3876
which gives us hash_combine
and hash_val
from
boost,
then this is relatively doable:
} // mine namespace std { template <> struct hash<mine::X> { size_t operator()(mine::X const& x) const noexcept { size_t h = hash<tuple_element<0, decltype(x.date_)>::type>{}(get<0>(x.date_)); hash_combine(h, get<1>(x.date_), get<2>(x.date_)); for (auto const& p : x.data_) hash_combine(h, p.first, p.second); return h; } }; } // std
First we need to break out of our own namespace, and then specialize
std::hash
in namespace std
. And we also need to add
a friend
statement to our class X:
friend struct std::hash<X>;
Without hash_combine
from
N3876
we would have to write our own hash_combine
. This could easily
result in a bad hash function as aptly described in
N3876.
In our first attempt to use the tools presented in
N3333,
we were surprised at the difficulty, as we were expecting it to be easier.
However after studying the reference implementation in
LLVM, we succeeded in writing the following
friend
function of X:
friend std::hash_code hash_value(X const& x) { using std::hash_value; return std::hash_combine ( hash_value(std::get<0>(x.date_)), hash_value(std::get<1>(x.date_)), hash_value(std::get<2>(x.date_)), std::hash_combine_range(x.data_.begin(), x.data_.end()) ); }
We also strongly suspect that with a little more work on the proposal, this could be simplified down to:
friend std::hash_code hash_value(X const& x) { using std::hash_value; return std::hash_combine(hash_value(x.date_), std::hash_combine_range(x.data_.begin(), x.data_.end())); }
Or possibly even:
friend std::hash_code hash_value(X const& x) noexcept { using std::hash_value; return std::hash_combine(hash_value(x.date_), hash_value(x.data_)); }
The reduced burden on the author of X on writing the code to hash X is very much welcomed! However, hashing algorithms are notoriously difficult to write. Has the author of X written a good hashing algorithm?
The answer is that the author of X does not know, until he experiments with his data. The hashing algorithm is supplied by the std::lib implementor. If testing reveals that the algorithm chosen by the std::lib implementor is not appropriate for the client's data set, then everything offered by both N3333 and N3876 is for naught. The author of X is on his own, starting from scratch, to build an alternate hashing algorithm -- even if just to experiment.
This concern is not theoretical. If the keys to be hashed can be influenced by a malicious attacker, it is quite possible for the attacker to arrange for many distinct keys that all hash to the same hash code. Even some seeded hashing algorithms are vulnerable to such an attack.
Here is a very short and fast C++ program that can generate as many distinct keys as you like which all hash to the same hash code using MurmurHash3, even with a randomized seed. Here is another such C++ program demonstrating a similar seed-independent attack on CityHash64. These attacks do not mean that these are bad hashing algorithms. They simply are evidence that it is not wise to tie yourself down to a single hashing algorithm. And if the effort to change, or experiment with, hashing algorithms takes effort that is O(N) (where N is the number of types or sub-types to be hashed), then one is tied down.
This paper demonstrates infrastructure allowing the author of X to switch hashing algorithms with O(1) work, regardless of how many sub-types of X need to be hashed. No matter what hashing algorithm is used, the C++ code to hash X is the same:
template <class HashAlgorithm> friend void hash_append(HashAlgorithm& h, X const& x) noexcept { using std::hash_append; hash_append(h, x.date_, x.data_); }
With this proposal, the author of X gets simplicity, without being heavily invested
in any single hashing algorithm. The hashing algorithm is completely encapsulated
in the templated parameter HashAlgorithm
, and the author of X remains
fully and gratefully ignorant of any specific hashing algorithm.
The key to solving this problem is the recognition of one simple observation:
Types should not know how to hash themselves. However types do know what parts of their state should be exposed to a hashing algorithm.
The question now becomes: How do you present X to a general purpose hashing algorithm without binding it to any specific algorithm?
Just as an example, here is a very simple hashing algorithm that many have used with great success:
std::size_t fnv1a (void const* key, std::size_t len) { unsigned char const* p = static_cast<unsigned char const*>(key); unsigned char const* const e = p + len; std::size_t h = 14695981039346656037u; for (; p < e; ++p) h = (h ^ *p) * 1099511628211u; return h; }
Although most modern hashing algorithms are much more complicated than
fnv1a
shown above, there are similarities among them.
void const*
and a size_t
length.Not all, but most of the algorithms also have the property that they consume bytes in the order that they are received, possibly with a fixed sized internal buffer. This characteristic can be taken advantage of in order to hash discontiguous memory.
For example consider this minor repackaging of the FNV-1a algorithm:
class fnv1a { std::size_t state_ = 14695981039346656037u; public: using result_type = std::size_t; void operator()(void const* key, std::size_t len) noexcept { unsigned char const* p = static_cast<unsigned char const*>(key); unsigned char const* const e = p + len; for (; p < e; ++p) state_ = (state_ ^ *p) * 1099511628211u; } explicit operator result_type() noexcept { return state_; } };
Now the algorithm can be accessed in 3 stages:
operator()(void const* key,
std::size_t len)
function. Note that this function can be called any
number of times. In each call the hashed memory is contiguous. But there is
no requirement at all that separate calls refer to a single block of memory.
On each call, the state of the algorithm is recalled from the previous call
(or from the initialization step) and updated with the new len
bytes located at key
.result_type
(in this case a size_t
).
This is the finalization stage, which in this
case is trivial, but could be arbitrarily complex.With the FNV-1a algorithm divided into its 3 stages like this, one can call it in various ways, for example:
fnv1a::result_type hash_contiguous(int (&data)[3]) { fnv1a h; h(data, sizeof(data)); return static_cast<fnv1a::result_type>(h); }
or
fnv1a::result_type hash_discontiguous(int data1, int data2, int data3) { fnv1a h; h(&data1, sizeof(data1)); h(&data2, sizeof(data2)); h(&data3, sizeof(data3)); return static_cast<fnv1a::result_type>(h); }
But either way it is called, given the same inputs, the algorithm outputs the exact same result:
int data[] = {5, 3, 8}; assert((hash_contiguous(data) == hash_discontiguous(5, 3, 8)));
We can say that fnv1a
meets the requirements of a
HashAlgorithm
. A HashAlgorithm
is a class type
that can be constructed (default, or possibly with seeding), has an
operator()
member function with the signature represented above.
The operator()
member function processes bytes, updating the
internal state of the HashAlgorithm
. This internal state can be
arbitrarily complex. Indeed an extreme example of internal state could be a
copy of every chunk of memory supplied to the HashAlgorithm
.
And finally a HashAlgorithm
can be explicitly converted to the
nested type result_type
, which when used with the unordered
containers should be an alias for size_t
.
At all times during its lifetime, a HashAlgorithm
is
CopyConstructible
and CopyAssignable
, with each
copy getting an independent copy of the state from the right-hand-side of the
copy (value semantics -- no aliasing among copies). Thus if one knew that
two sequences of data shared a common prefix, one could hash the prefix in
just one sequence, make a copy of the HashAlgorithm
, and then
continue after the prefix in each sequence with the two independent
HashAlgorithm
s. This would be a pure optimization, getting the
same results as if one had hashed each sequence in full.
Given the concept of HashAlgorithm
, a universal hash functor,
which takes any type T
can now be written (almost):
template <class HashAlgorithm> struct uhash { using result_type = typename HashAlgorithm::result_type; template <class T> result_type operator()(T const& t) const noexcept { HashAlgorithm h; using std::hash_append; hash_append(h, t); return static_cast<result_type>(h); } };
Now one can use uhash<fnv1a>
as the hash function for
std::unordered_map
, for example:
std::unordered_map<MyKey, std::string, uhash<fnv1a>> the_map;
First note several important attributes of uhash
:
uhash
depends only on the hashing algorithm, which is
encapsulated in the HashAlgorithm
. uhash
does not
depend upon the type T
being hashed.
uhash
is simple. Though such a utility should certainly be
supplied by the std::lib, any programmer can very easily implement their own
variant of uhash
for desired customizations (e.g.
random seeding,
salting,
or padding),
without having to revisit the hashing code for distinct types.
size_t
. For example, this could come in
handy for computing a
SHA-256 result. And all
without having to revisit each individual type!
Let's walk through uhash
one step at a time.
The HashAlgorithm
is constructed (default constructed in this
example, but that is not the only possibility). This step initializes the
hashing algorithm encapsulated in the HashAlgorithm
.
It is appended to using t
as a key. The function
hash_append
is implemented for each type that supports hashing.
We will see below that such support code need be written only once per type in
order to support many hashing algorithms. It is implemented in the type's own
namespace, but there are implementations in namespace std for most scalars
(just like swap
).
And then the HashAlgorithm
is explicitly converted to the
desired result. This is where the algorithm is "finalized."
The above hash
functor is accessing the generic hashing algorithm
by its 3 distinct phases. Additionally, this hash
functor could
even be defaulted to use your favorite hash algorithm:
template <class HashAlgorithm = fnv1a> struct uhash;
The question usually arises now: Are you proposing that uhash<>
replace hash<T>
as the default hash functor in the
unordered containers? The answer is it really almost doesn't matter. With
templated using declarations, it is just so easy for programmers to specify
their own defaults:
namespace my { template <class Key, class T, class Hash = std::uhash<>, class Pred = std::equal_to<Key>, class Alloc = std::allocator<std::pair<Key const, T>>> using unordered_map = std::unordered_map<Key, T, Hash, Pred, Alloc>; } // my // ... my::unordered_map<MyKey, std::string> the_map; // uses std::uhash<> instead of std::hash<MyKey>
hash_append
?
The hash_append
function is the way that individual types
communicate with the HashAlgorithm
.
Each type T
is responsible only for exposing its hash-worthy state
to the HashAlgorithm
in the function hash_append
.
T
is not responsible for combining hash codes. Nor is it
responsible for any hashing arithmetic whatsoever. It is only responsible for
pointing out where its data is, how many different chunks of data there are,
and what order they should be presented to the HashAlgorithm
.
For example, here is how X might implement hash_append
:
class X { std::tuple<short, unsigned char, unsigned char> date_; std::vector<std::pair<int, int>> data_; public: // ... friend bool operator==(X const& x, X const& y) { return std::tie(x.date_, x.data_) == std::tie(y.date_, y.data_); } // Hook into the system like this template <class HashAlgorithm> friend void hash_append(HashAlgorithm& h, X const& x) noexcept { using std::hash_append; hash_append(h, x.date_); hash_append(h, x.data_); } }
Like swap
, hash_append
is a customization point for
each type. Only a type knows what parts of itself it should expose to a
HashAlgorithm
, even though the type has no idea what algorithm
is being used to do the hashing. Note that X need not concern itself with
details like whether or not its sub-types are contiguously hashable.
Those details will be handled by the hash_append
for the
individual sub-types. The only information the
hash_append
overload for X must implement is what sub-types need
to be presented to the HashAlgorithm
, and in what order.
Furthermore the hash_append
function is intimately tied to the
operator==
for the same type. For example if for some reason
x.data_
did not participate in the equality computation, then it
should also not participate in the hash_append
computation.
hash_append
to operator==
For all combination of two values of X, x
and y
, there
are two rules to follow in designing hash_append
for type X.
Actually the second rule is more of a guideline. But it should be followed as
closely as possible:
If x == y
, then both x
and y
shall send the same message to the HashAlgorithm
in
hash_append
.
If x != y
, then x
and y
should
send different messages to the HashAlgorithm
in
hash_append
.
It is very important to keep these two rules in mind when designing the
hash_append
function for any type, or for any instantiation of a
class template. Failure to follow the first rule will mean that equal values
hash to different codes. Clients such as unordered containers will simply
fail to work, resulting in run time errors if this rule is violated. Failure
to follow the second guideline will result in hash collisions for the two
different values that send identical messages to the
HashAlgorithm
, and will thus degrade the performance of clients
such as unordered containers.
hash_append
for vector<T, A>
For example std::vector<T, A>
would
never expose its capacity()
, since capacity()
can be
different for vector
's that otherwise compare equal. Likewise
it should not expose its allocator_type
to hash_append
,
since this value also does not participate in the equality computation.
Should vector
expose its size()
to the
HashAlgorithm
? To find out, lets look closer at the
operator==
for vector
:
Two
vector
'sx
andy
compare equal ifx.size() == y.size()
and ifx[i] == y[i]
fori
in the range of 0 tox.size()
.
To meet rule 1, it is sufficient that every
element in the vector
be sent to the HashAlgorithm
as part of the vector
's message. A logical convention is that
the elements will be sent in order from begin()
to
end()
. But this alone will not satisfy rule 2. Consider:
std::vector<std::vector<int>> v1{}; std::vector<std::vector<int>> v2{1}; assert(v1 != v2);
v1
and v2
are not equal. v1.size() ==
0
and v2.size() == 1
. However v2.front().size()
== 0
. If an empty vector<int>
sends no message at
all to the HashAlgorithm
, then v2
, even though it
is not empty, also sends no message to the HashAlgorithm
. And
therefore v1
and v2
send the same (0 length)
message to the HashAlgorithm
, violating rule 2.
One idea for fixing this is to special case 0-length vector
s to
output a special value such as "empty" or 0. However in the first case the
result would be ambiguous with a vector<string>
of length
1 containing the string "empty". The second case has the exact same problem
but for vector<int>
.
The right way to fix this problem is to have vector<T>
send its size()
in addition to sending all of its members to the
HashAlgorithm
. Now the only question is: Should it send its
size
before or after sending its members to the
HashAlgorithm
?
To answer this last question, consider another sequence container:
forward_list<T>
. It has the exact same issues as we have
been discussing for vector<T>
, but
forward_list<T>
has no size()
member. In
order to send its size()
, forward_list<T>
has
to loop through all of its members to first compute size()
. In
order to avoid the requirement that hash_append
for
forward_list<T>
make two passes through the list, we
should specify that the size()
of the container is sent to the
HashAlgorithm
after all of the elements are sent. And
for consistency, we should do this for all std-containers for which
hash_append
is defined.
template <class HashAlgorithm, class T, class Alloc> void hash_append(HashAlgorithm& h, std::vector<T, Alloc> const& v) noexcept { for (auto const& t : v) hash_append(h, t); hash_append(h, v.size()); }
I.e. vector
considers itself a message composed of 0 or more
sub-messages, and appends each sub-message (in order) to the state of the
generic HashAlgorithm
. And this is followed with a final
message consisting of the size()
of the vector
.
Note that as
N3333
and
N3876
both stand today, this critically important but subtle detail is not treated,
and is left up to the client (the author of X) to get right. This proposal
states that this is a detail that the hash_append
for
vector
(and every other hashable std-container) is responsible for.
Emphasis
The message a type sends to a
HashAlgorithm
is part of its public API. E.g. whether or not a container includes itssize()
in itshash_append
message, and if so, whether thesize()
is prepended or appended to the message, is critical information a type's client needs to know, in order to ensure that their composition of some type's message with another type's message doesn't produce an ambiguous message (doesn't create collisions).The standard should clearly document the message emanating from every
hash_append
it defines, to the extent possible. It might not be possible to nail down that an implementation is using IEEE floating point or two's complement signed integers. But the standard can certainly document the message produced by avector
or any other std-defined class type.
hash_append
for std::pair<T, U>
The situation is simpler for std::pair<T, U>
:
template <class HashAlgorithm, class T, class U> void hash_append (HashAlgorithm& h, std::pair<T, U> const& p) noexcept { hash_append (h, p.first); hash_append (h, p.second); }
All there is to do is to just hash_append
the first and second
members of the pair.
hash_append
for int
Eventually hash_append
will drill down to a scalar type such as
int
:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, int const& i) noexcept { h(&i, sizeof(i)); }
Whereupon a contiguous chunk of memory is actually accumulated by the
HashAlgorithm
, using the HashAlgorithm
's
operator()
. Recall that the HashAlgorithm
has a
member function operator()(const void* key, std::size_t len)
noexcept
. And the int
is just a chunk of
contiguous memory that is hashable. It is now prudent to
deeply consider what it means to say that a type (such as int
)
is contiguously hashable.
A type T
is contiguously hashable if for all combinations of
two values of a type, say x
and y
, if x ==
y
, then it must also be true that memcmp(addressof(x),
addressof(y), sizeof(T)) == 0
. I.e. if x == y
, then
x
and y
have the same bit pattern representation. A
2's complement int
satisfies this property because every bit
pattern an int
can have results in a distinct value (rule 2). And there are no "padding bits" which might take on
random values. This property is necessary because if two values are equal, then
they must hash to the same hash code (rule 1).
is_contiguously_hashable<T>
:With that in mind we can easily imagine a type trait:
template <class T> struct is_contiguously_hashable;
which derives from either true_type
or false_type
. And
on 2's complement systems,
is_contiguously_hashable<int>::value
is true
.
And we might anticipate that some other types, such as char
and
long long
are also contiguously hashable. With this
tool we can now easily write hash_append
for all contiguously
hashable types:
template <class HashAlgorithm, class T> inline std::enable_if_t < is_contiguously_hashable<T>::value > hash_append(HashAlgorithm& h, T const& t) noexcept { h(addressof(t), sizeof(t)); }
Now the task remains to specialize is_contiguously_hashable
properly for those scalars we want to use this implementation of
hash_append
for, and for any other scalars, implement
hash_append
appropriately. As an example of the latter, consider
IEEE floating point types.
An IEEE floating point type is not contiguously hashable because 0.
== -0.
but these two values are represented with different bit patterns.
Rule 1 would be violated if hashed contiguously. Therefore the
hash_append
for IEEE floating point types must go to extra effort
to ensure that 0.
and -0.
hash to identical hash
codes, but without dictating a specific hash algorithm. This could be
done like so:
template <class HashAlgorithm, class T> inline std::enable_if_t < std::is_floating_point<T>::value > hash_append(HashAlgorithm& h, T t) noexcept { if (t == 0) t = 0; h(&t, sizeof(t)); }
I.e. if the value is -0., reset the value to 0., and then contiguously hash the resulting bits.
N3333
also introduced a very similar is_contiguous_layout
trait.
Although the paper did not make it perfectly clear, we believe
is_contiguously_hashable
is
approximately the same trait, but with a better name. Just because a type
has a contiguous layout does not necessarily imply that a type is
contiguously hashable. IEEE floating point is a case in point. IEEE
floating point does have a contiguous layout (and is trivially copyable, and
has a standard layout). And yet still it is not contiguously hashable
because of how its operator==
works with signed zeros (violating
rule 1).
Class types that are composed of only contiguously hashable types
and that have no padding bytes, may also be considered to be
contiguously hashable. For example consider this specialization of
is_contiguously_hashable<std::pair<T, U>>
:
template <class T, class U> struct is_contiguously_hashable<std::pair<T, U>> : public std::integral_constant<bool, is_contiguously_hashable<T>::value && is_contiguously_hashable<U>::value && sizeof(T) + sizeof(U) == sizeof(std::pair<T, U>)> { };
In English: If the pair
's two types are both contiguously
hashable, and if the size of the two members is the same size as the
pair
(so there are no padding bytes), then the entire
pair
itself is contiguously hashable!
This same logic can be applied to array
, tuple
, and
possibly user-defined types as well (but only with the user-defined type's
author's permission). Consequently, a great many types can be easily and
safely classified as contiguously hashable. This is important because with
modern hash algorithm implementations, the bigger the chunk of contiguous
memory you can send to the HashAlgorithm
at one time, the higher
the performance (in terms of bytes-hashed/second) the
HashAlgorithm
is likely to perform.
With that in mind (the bigger the memory chunk the better), consider again
hash_append
for vector
:
template <class HashAlgorithm, class T, class Alloc> inline std::enable_if_t < !is_contiguously_hashable<T>::value > hash_append(HashAlgorithm& h, std::vector<T, Alloc> const& v) noexcept { for (auto const& t : v) hash_append(h, t); hash_append(h, v.size()); } template <class HashAlgorithm, class T, class Alloc> inline std::enable_if_t < is_contiguously_hashable<T>::value > hash_append(HashAlgorithm& h, std::vector<T, Alloc> const& v) noexcept { h(v.data(), v.size()*sizeof(T)); hash_append(h, v.size()); }
I.e. if the T
is contiguously hashable, then even though
vector
itself is not, there can still be a huge optimization
made by having vector
send its contiguous data
buffer to hash_append
.
Note that this is a pure optimization. I.e. the
HashAlgorithm
sees the exact same sequence of bytes, in
the same order, whether or not this optimization for vector
is
done. But if it is done, then the HashAlgorithm
sees almost all
of the bytes at once.
This optimization could be made for vector
without any help from the
std::lib
. Other optimizations are possible, but could only be made
from within the std::lib
. For example, what if T
is
bool
in the above example? vector<bool>
doesn't
follow the usual vector
rules. What about
deque<T>
? It could hash its internal contiguous buffers all
at once, but there is no way to implement that without intimate knowledge of the
internals of the deque
implementation. Externally, the best one
can do for deque<T>
is to send each individual T
to hash_append
one at a time. This still gives the very same
correct message, but is just much slower.
Because only the std::lib implementor can fully implement this optimization for
types such as deque
, bitset
and
vector<bool>
, it is important that we standardize
is_contiguously_hashable
and hash_append
instead of
asking the programmer to implement them (for std-defined types).
If you believe your type to be contiguously hashable, you should
specialize is_contiguously_hashable<YourType>
appropriately,
as has been shown for pair
. This would mean that not only is
hashing YourType
optimized, but hashing
vector<YourType>
, et. al. is also optimized! But note that
there is no bullet proof way to automate the registration of
YourType
with is_contiguously_hashable
as IEEE
floating point so ably demonstrates. To do so requires an in depth analysis
of operator==
for YourType
, which only the author
of YourType
is qualified to do.
hash_append
the same thing as
boost::hash_combine
?No!
boost::hash_combine
is used to combine an already computed hash
code with an object that is to be hashed with boost::hash<T>
(and this is also the
N3876
hash_combine
, modulo using std::hash<T>
).
The
N3333
hash_combine
takes two objects, hashes both of them with
std::hash<T>
, and combines those two hash codes into one.
In contrast hash_append
is used to expose an object's hashable
state to an arbitrary hashing algorithm. It is up to the generic
hashing algorithm to decide how to combine later bytes with earlier bytes.
hash_append
the same thing as
serialization?
It is very closely related. Close enough that there may be a way to
elegantly combine the two. Each type can expose its state to a
HashAlgorithm
or Serializer
. However there
are differences. IEEE floating point is our poster-child for the
difference. For hashing, IEEE floating point needs to hide the difference
between -0. and 0. For serialization one needs to keep these two values
distinct. Combining these two functions, for now, remains beyond the scope
of this paper.
hash_append
?Yes, this is easily written as:
template <class HashAlgorithm, class T0, class T1, class ...T> inline void hash_append (HashAlgorithm& h, T0 const& t0, T1 const& t1, T const& ...t) noexcept { hash_append (h, t0); hash_append (h, t1, t...); }
This allows hash_append
for X (for example) to be rewritten as:
template <class HashAlgorithm> friend void hash_append(HashAlgorithm& h, X const& x) noexcept { using std::hash_append; hash_append(h, x.date_, x.data_); }
Algorithms such as CityHash are not efficiently adapted to this infrastructure, because as currently coded, CityHash actually hashes the end of the buffer first. However SpookyHash, which is reported to have quality comparable to CityHash is trivial to incorporate:
#include "SpookyV2.h" class spooky { SpookyHash state_; public: using result_type = std::size_t; spooky(std::size_t seed1 = 1, std::size_t seed2 = 2) noexcept { state_.Init(seed1, seed2); } void operator()(void const* key, std::size_t len) noexcept { state_.Update(key, len); } explicit operator result_type() noexcept { std::uint64_t h1, h2; state_.Final(&h1, &h2); return h1; } };
MurmurHash2, MurmurHash3, and the cryptographically secure algorithms SipHash and the SHA-2 family are also efficiently adaptable to this framework. Indeed, CityHash is the only hashing algorithm we have come across to date which is not efficiently adapted to this framework.
Given the class X shown above, with its complex state distributed among at
least two different contiguous chunks of memory, and potentially many more
if the container switched from vector
to deque
or
list
, one can create an unordered container with the default
hash function like so:
std::unordered_set<X, std::uhash<>> my_set;
If one instead wanted to specify FNV-1a, the code is easily modified to:
std::unordered_set<X, std::uhash<fnv1a>> my_set;
This would change the hash code algorithm for every vector
,
every deque
, every string
, every char
,
every int
, etc. for which X considered part of its hash-worthy state.
That is, hashing algorithms are controlled at the top of the data structure
chain, at the point where the client (e.g. unordered_map
) asks for
the hash. It is not controlled at all down at the bottom of the data structure
chain. I.e. int
has no clue how to hash itself. It only knows
what state needs to be exposed to a hashing algorithm.
And there is no combining step. The hash algorithm works identically as if you had copied all of the various discontiguous chunks of state into one big contiguous chunk of memory, and fed that one big chunk to the hash algorithm.
If one wants to use spooky
instead, simply change in one
place:
std::unordered_set<X, std::uhash<spooky>> my_set;
If a new hashing algorithm is invented tomorrow, and you want to use it, all that needs to be done is to write an adaptor for it:
class new_hash_function { public: using result_type = std::size_t; new_hash_function() noexcept; void operator()(void const* key, std::size_t len) noexcept; explicit operator result_type() noexcept; };
And then use it:
std::unordered_set<X, std::uhash<new_hash_function>> my_set;
You do not need to revisit the hash_append
for X, nor for any
of X's sub-types. The the N hashing algorithms x M sub-types problem has
been solved!
hash_append
Pimpl designs?
So far, every hash_append
function shown must be templated on
HashAlgorithm
so as to handle any hashing algorithm requested by
some unknown, far away client. But with the
Pimpl design, one
can not send a templated HashAlgorithm
past the implementation
firewall.
Or can you ... ?
With the help of std::function
one can type erase the
templated HashAlgorithm
, adapting it to a type with a concrete
type, and pass that concrete HashAlgorithm
through the
implementation firewall. Imagine a class as shown
here.
Here is how it can support arbitrary hash algorithms with the proposed
infrastructure:
class Handle { struct CheshireCat; // Not defined here CheshireCat* smile; // Handle public: // Other operations... // Hash support using type_erased_hasher = acme::type_erased_hasher<std::size_t>; friend void hash_append(type_erased_hasher&, CheshireCat const&); template <class HashAlgorithm> friend void hash_append(HashAlgorithm& h, Handle const& x) { using std::hash_append; if (x.smile == nullptr) hash_append(h, nullptr); else { type_erased_hasher temp(std::move(h)); hash_append(temp, *x.smile); h = std::move(*temp.target<HashAlgorithm>()); } } };
So you still have to implement a templated hash_append
for
Handle
, but the implementation of that function forwards to a
non-template function which can be implemented in the source, within
the definition of CheshireCat
:
friend void hash_append(Handle::type_erased_hasher& h, CheshireCat const& c) { using std::hash_append; hash_append(h, c.data1_, c.data2_, etc. ...); }
Besides the type of the HashAlgorithm
, hash_append
for CheshireCat
looks just like any other
hash_append
.
The magic is in acme::type_erased_hasher<std::size_t>
, not
proposed (thus the namespace acme
).
Appendix A outlines exactly how to code
acme::type_erased_hasher<std::size_t>
. In a nutshell, this
is a HashAlgorithm
adaptor, which takes any
HashAlgorithm
, stores it in a std::function<void(void
const*, std::size_t)>
, and makes the std::function
behave like a HashAlgorithm
.
Think about what has just happened here. You've compiled CheshireCat.cpp
today. And tomorrow, when somebody invents a brand new hash algorithm,
your CheshireCat.cpp uses it, with no re-compile necessary, for the cost of a
virtual function call (or many such calls) to the HashAlgorithm
.
And yet no other client of this new HashAlgorithm
(outside of
those called by CheshireCat
), is forced to access the new
hashing algorithm via a virtual function call. That borders on magic!
It is this very concern (hashing of Pimpl's) that decided the name of the
member function of HashAlgorithm
s which appends state to the
hash algorithm:
void operator()(void const* key, std::size_t len) noexcept;
Had this member function been given any other name, such as:
void append(void const* key, std::size_t len) noexcept;
then programmers would not be able to use std::function
to
create a type-erased wrapper around a templated HashAlgorithm
.
Many hash algorithms can be randomly seeded during the initialization stage
in such a way that the hash code produced for a type is constant between
invocations by a single client (just like a non-seeded algorithm), but varies
between clients. The variance might be per-process, but could also be as
frequent as per-hash-functor construction, excluding copy or move
construction. In the latter case one might have two distinct
unordered_set
s (for example) of the same type, and even
containing the same data, and yet have the two containers result in different
hash codes for the same values. Doing so can help harden an application from
attacks when the application must hash keys supplied by an untrusted source.
This is remarkably easily done with this proposal. One codes one new
hash functor, which can be used with any HashAlgorithm
which accepts a seed, and for any type which already has
hash_append
implemented (even those CheshireCat
s
which have already been compiled, and can not be recompiled).
Here is one possible implementation for a hash functor that is randomly seeded by a seed selected on a per-process basis:
std::tuple<std::uint64_t, std::uint64_t> get_process_seed(); template <class HashAlgorithm = acme::siphash> class process_seeded_hash { public: using result_type = typename HashAlgorithm::result_type; template <class T> result_type operator()(T const& t) const noexcept { std::uint64_t seed0; std::uint64_t seed1; std::tie(seed0, seed1) = get_process_seed(); HashAlgorithm h(seed0, seed1); using std::hash_append; hash_append(h, t); return static_cast<result_type>(h); } };
And then in a source:
namespace { std::tuple<std::uint64_t, std::uint64_t> init_seeds() { std::mt19937_64 eng{std::random_device{}()}; return std::tuple<std::uint64_t, std::uint64_t>{eng(), eng()}; } } // unnamed std::tuple<std::uint64_t, std::uint64_t> get_process_seed() { static std::tuple<std::uint64_t, std::uint64_t> seeds = init_seeds(); return seeds; }
And then use it:
std::unordered_set<MyType, process_seeded_hash<>> my_set;
In this example, the hashing algorithm is initialized with a random seed when
process_seeded_hash
is invoked. The same seed is used to
initialize the algorithm on each hash functor invocation, and for all copies
of the functor, for the life of the process.
Alternatively, one could randomly seed the hash functor on each default construction:
template <class HashAlgorithm = acme::siphash> class randomly_seeded_hash { private: static std::mutex mut_s; static std::mt19937_64 rand_s; std::size_t seed0_; std::size_t seed1_; public: using result_type = typename HashAlgorithm::result_type; randomly_seeded_hash() { std::lock_guard<std::mutex> _(mut_s); seed0_ = rand_s(); seed1_ = rand_s(); } template <class T> result_type operator()(T const& t) const noexcept { HashAlgorithm h(seed0_, seed1_); using std::hash_append; hash_append(h, t); return static_cast<result_type>(h); } }; template <class HashAlgorithm> std::mutex randomly_seeded_hash<HashAlgorithm>::mut_s; template <class HashAlgorithm> std::mt19937_64 randomly_seeded_hash<HashAlgorithm>::rand_s{std::random_device{}()};
Perhaps using it like:
std::unordered_set<MyType, randomly_seeded_hash<acme::spooky>> my_set;
One uses the same technique to apply
salting,
or padding
to a type to be hashed. E.g. one would prepend and/or append the
salt or
padding
to the message of T
by using additional calls to
hash_append
in the operator()(T const& t)
of the
hash functor.
Emphasis
There is no need for the standard to specify a random seeding policy or interface, because using this infrastructure the client can very easily specify his own random seeding policy without having to revisit every type that needs to be hashed, and without having to heavily invest in any given hashing algorithm. It can be done with only a few dozen lines of code. And he can easily do so in a per-use manner: I.e. in use-case A we need to randomly seed the hashing of types X and Y. And in use-case B we need to not seed the hashing of types Y and Z. Type Y is correctly handled in both use-cases, and without having to revisit Y or Y's sub-types. Y remains ignorant of the detail as to whether it is being hashed with a random seed or not (or even with what hashing algorithm).
Flexibility is built into this system in exactly the right places so as to achieve maximum options for the programmer with an absolute minimum of programmer intervention. The std::lib merely has to set up the right infrastructure, and provide a simple default.
The unordered containers present a problem. The problem is not specific to this infrastructure. Neither N3333 nor N3876 solve this problem either. But we highlight the problem here so as to definitively state that we do not solve the problem here either.
Given two unordered_set<int>
:
std::unordered_set<int> s1{1, 2, 3}; std::unordered_set<int> s2{3, 2, 1};
One can assert that s1 == s2
, and yet if one iterates over
s1
and s2
, one will not (in general) come upon the
same elements in the same order. So in what order do you hash the elements
of an unordered sequence? Since s1 == s2
, then hash(s1)
== hash(s2)
must also be true.
There are several answers to this dilemma that will work. However there is no
answer that is definitely better than all other answers. Therefore we recommend
that we not standardize a hash_append
overload for any of
the unordered containers. If a client really wants to hash an unordered
container, then they can choose a technique that works for them, and do so.
For example, one could hash each element using a copy of the
HashAlgorithm
, and then append the sum of all hash codes to the
state of the original HashAlgorithm
:
template <class HashAlgorithm, class Key, class Hash, class Pred, class Alloc> void hash_append(HashAlgorithm& h, std::unordered_set<Key, Hash, Pred, Alloc> const& s) { using result_type = typename HashAlgorithm::result_type; result_type k{}; for (auto const& x : s) { HashAlgorithm htemp{h}; hash_append(htemp, x); k += static_cast<result_type>(htemp); } hash_append(h, k, s.size()); }
Or one could sort all the elements and hash the elements in sorted order:
template <class HashAlgorithm, class Key, class Hash, class Pred, class Alloc> void hash_append(HashAlgorithm& h, std::unordered_set<Key, Hash, Pred, Alloc> const& s) { hash_append(h, std::set<Key>(s.begin(), s.end())); }
And there are various other schemes. They are all implementable. But they each
have their advantages and disadvantages. Therefore this proposal proposes
none of them. Should the future expose the ideal hash_append
specification for unordered containers, it can always be added at that time.
The answer to this question is nothing. But to demonstrate this answer, X has been given a randomized default constructor:
std::mt19937_64 eng; X::X() { std::uniform_int_distribution<short> yeardata(1914, 2014); std::uniform_int_distribution<unsigned char> monthdata(1, 12); std::uniform_int_distribution<unsigned char> daydata(1, 28); std::uniform_int_distribution<std::size_t> veclen(0, 100); std::uniform_int_distribution<int> int1data(1, 10); std::uniform_int_distribution<int> int2data(-3, 3); std::get<0>(date_) = yeardata(eng); std::get<1>(date_) = monthdata(eng); std::get<2>(date_) = daydata(eng); data_.resize(veclen(eng)); for (auto& p : data_) { p.first = int1data(eng); p.second = int2data(eng); } }
Given this, one can easily create a great number of random X's and specify any hash algorithm. Herein I'm testing 7 implementations of hashing 1,000,000 X's:
Using std::hash
augmented with
N3876
as shown in Solution 1.
Using llvm::hash_value
as shown here
which is intended to be representative of
N3333.
Using uhash<fnv1a>
where fnv1a
is derived from
here.
Using uhash<jenkins1>
where jenkins1
is derived from
here.
Using uhash<MurmurHash2A>
where MurmurHash2A
is derived from
here.
Using uhash<spooky>
where spooky
is derived from
here.
Using uhash<siphash>
where siphash
is derived from
here.
The hash function quality tester suite used herein is described below.
This test looks at each 64 bit hash code as a collection of 16 hex-digits. The expectation is that each hex-digit should be roughly equally represented in each hexadecimal place of the hash code. The test returns maximum deviation of the average, from the expected average. An ideal score is 0.
This test simply counts the number of duplicate hashes. A score of 0 indicates each hash code is unique. A score of 1 indicates that all hash codes are the same.
This test is TestDistribution gratefully borrowed from the smhasher test suite. An ideal score is 0.
This test hashes the hash codes into a list of buckets sized to the number of hash codes (load factor == 1). It then scores each bucket with the number of comparisons required to look up each element in the bucket, and then averages the number of comparisons per lookup. Given a randomized hash, the result should be lf/2+1, where lf is load factor. This test returns the percent difference above/below the ideal randomized result.
This test hashes the hash codes into a list of buckets sized to the number of hash codes (load factor == 1). It then returns the max collision count among all of the buckets. This represents the maximum cost for a lookup of an element not found. Assuming the not-found-key hashes to a random bucket, the average cost of looking up a not-found-key is simply the load factor (i.e. independent of the quality of the hash function distribution).
A million hash codes are generated from a million randomized but unique X's,
randomized by a default constructed std::mt19937_64
, and fed to
these tests. For each test, the smaller the result the better.
Test Results -- smaller is better test 1 test 2 test 3 test 4 test 5 total time (sec) std::hash<X> 0.273744 0.0000460148 0.966285 -0.000257333 8 0.472963s llvm::hash_value 0.012 0 0.000913715 0.000254667 8 0.545283s fnv1a 0.012576 0 0.00111228 0.000248 9 0.84607s jenkins1 0.0121775 0 0.00121271 -0.000572667 9 1.1119s MurmurHash2A 15 0.000110984 6.36293 0.000393333 9 0.467501s spooky 0.010816 0 0.000968923 -0.000359333 9 0.628721s siphash 0.011072 0 0.00113216 -0.000162667 8 0.584353s
The intent in showing the above table is two-fold: To show that running times
are competitive, and that with the exception of MurmurHash2A
, and
possibly std::hash<X>
, the quality results are competitive.
If one insists on picking "the best algorithm" from this table, we caution
you with one additional test.
The table below represents the same test except that X's data members have been changed as shown:
std::tuple<shortint, unsigned char, unsigned char> date_; std::vector<std::pair<int,intshort>> data_;
Other than this change in types, no other change has been made. Not even the random values assigned to each type. Here are the results:
Test Results (alternative) -- smaller is better test 1 test 2 test 3 test 4 test 5 total time (sec) std::hash<X> 0.273744 0.0000460148 0.966285 -0.000257333 8 0.476888s llvm::hash_value 0.0111664 0 0.00105212 0.000167333 8 1.10327s fnv1a 0.011456 0 0.00132384 -0.00015 9 0.708467s jenkins1 0.010512 0 0.000864258 -0.000433333 10 0.9079s MurmurHash2A 15 0.000115991 6.36293 -0.000863333 8 1.09589s spooky 0.013328 0 0.00127677 -0.000210667 9 1.33591s siphash 0.01224 0 0.00111516 0.000342667 8 1.63656s
While none of the quality metrics changes that much with this minor change,
the timing tests did vary considerably. For example
llvm::hash_value
was previously one of the faster algorithms
(fastest among those with good quality results), and is now 50% slower than
the fastest among those with quality results. The take-away point is that
there is no "best algorithm". But there is a lot of value in being able to
easily change algorithms for testing, performance and security purposes.
This paper presents an infrastructure that decouples types from hashing algorithms. This decoupling has several benefits:
template <class T> struct is_contiguously_hashable; // A type property trait template <class HashAlgorithm> void hash_append(HashAlgorithm& h, T const& t) noexcept; // overloaded for each type T template <class HashAlgorithm = unspecified-default-hasher> struct uhash; // A hashing functor
There is an example implementation and lots of example code using the example implementation here. See hash_append.h for the example implementation.
type_erased_hasher
Though type_erased_hasher
is not proposed, it easily could be if
the committee so desires. Here is how it is implemented, whether by the
programmer, or by a std::lib implementor:
template <class ResultType> class type_erased_hasher { public: using result_type = ResultType; private: using function = std::function<void(void const*, std::size_t)>; function hasher_; result_type (*convert_)(function&); public: template <class HashAlgorithm, class = std::enable_if_t < std::is_constructible<function, HashAlgorithm>{} && std::is_same<typename std::decay_t<HashAlgorithm>::result_type, result_type>{} > > explicit type_erased_hasher(HashAlgorithm&& h) : hasher_(std::forward<HashAlgorithm>(h)) , convert_(convert<std::decay_t<HashAlgorithm>>) { } void operator()(void const* key, std::size_t len) { hasher_(key, len); } explicit operator result_type() noexcept { return convert_(hasher_); } template <class T> T* target() noexcept { return hasher_.target<T>(); } private: template <class HashAlgorithm> static result_type convert(function& f) noexcept { return static_cast<result_type>(*f.target<HashAlgorithm>());; } };
type_erased_hasher
must be templated on result_type
(or have a concrete result_type
), otherwise it can not have an
explicit conversion operator to that type.
The type_erased_hasher
stores a std::function<void(void
const*, std::size_t)>
, and a pointer to a function taking such a
function
and returning a result_type
. The latter is
necessary to capture the type of the HashAlgorithm
in the
type_erased_hasher
constructor, so that the same
HashAlgorithm
type can later be used in the conversion to
result_type
.
The constructor is naturally templated on HashAlgorithm
, which
can be perfectly forwarded to the underlying std::function
. The
constructor also initialized the function pointer convert_
using
the decayed type of HashAlgorithm
. The pointed-to function will
extract the HashAlgorithm
from the std::function
and explicitly convert it to the result_type
.
Note that the conversion to result_type
isn't explicitly used in
the Pimpl example of this paper. However the
hash_append
of the private implementation may need to copy the
type_erased_hasher
, and possibly convert a copy to a
result_type
as part of its hash computation. Such code has been
prototyped in motivating examples, such as the hash_append
for
an unordered sequence container.
The Pimpl example does need access to the stored
HashAlgorithm
after the call to hash_append
to
recover its state. This is accomplished with the target
function which simply forwards to std::function
's
target
function.
In Elements of Programming, Stepanov uses the term uniquely represented for
the property this paper refers to as contiguously hashable.
Therefore another good name name for is_contiguously_hashable
is
is_uniquely_represented
.
debugHasher
Another interesting "hash algorithm" is debugHasher
. This is a
small utility that can be used to help type authors debug their
hash_append
function. This utility is not proposed. It is simply
presented herein to illustrate the utility of this overall hashing infrastructure
design.
#include <iostream> #include <iomanip> #include <vector> class debugHasher { std::vector<unsigned char> buf_; public: using result_type = std::size_t; void operator()(void const* key, std::size_t len) noexcept { unsigned char const* p = static_cast<unsigned char const*>(key); unsigned char const* const e = p + len; for (; p < e; ++p) buf_.push_back(*p); } explicit operator std::size_t() noexcept { std::cout << std::hex; std::cout << std::setfill('0'); unsigned int n = 0; for (auto c : buf_) { std::cout << std::setw(2) << (unsigned)c << ' '; if (++n == 16) { std::cout << '\n'; n = 0; } } std::cout << '\n'; std::cout << std::dec; std::cout << std::setfill(' '); return buf_.size(); } };
debugHasher
is a fake "hashing algorithm" that does nothing but
collect the bytes sent to a hash by the entire collection of the calls to
hash_append
made by a key and all of its sub-types. The collection
of bytes is output to cout
when the hasher is converted to its
result_type
.
As can be readily seen, it is not difficult to create such a debugging tool. It is then used, just as easily:
std::vector<std::vector<std::pair<int, std::string>>> v {{{1, "abc"}}, {{2, "bca"}, {3, "cba"}}, {}}; std::cout << uhash<debugHasher>{}(v) << '\n';
Assuming a 32 bit int
, 64 bit size_t
, and little
endian, this will reliably output:
01 00 00 00 61 62 63 00 01 00 00 00 00 00 00 00 02 00 00 00 62 63 61 00 03 00 00 00 63 61 62 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 03 00 00 00 00 00 00 00 56
The last line is simply the number of bytes that have been sent to the
HashAlgorithm
. The first 4 lines are those bytes, formatted as
two hex digits per byte, with bytes separated by a space, and 16 bytes per
line for readability. If one carefully inspects this byte stream and
compares it to the data structure which has been "hashed", and to the
proposed hash_append
above for vector
,
string
and pair
, one can verify that the byte
stream is consistent with the specification.
Improving debugHasher
to collect useful statistics such as the
number of times called, and the average number of bytes hashed per call, is left
as a fun exercise for the reader.
Add a new section to [hash.requirements]:
HashAlgorithm requirements [hash.algo.rqmts]
A type
H
meets theHashAlgorithm
requirements if all of the following are met:
H::result_type
is valid and denotes aMoveConstrutible
type (14.8.2 [temp.deduct]).
H
is either default constructible, or constructible by some documented seed. This construction shall initializeH
to a deterministic state such that if two instances are constructed with the same arguments, then they have equivalent state.
H
isCopyConstructible
. Updates to the state of one copy shall have no impact on any other copy.
H
isCopyAssignable
. Updates to the state of one copy shall have no impact on any other copy. void operator()(void const* key, std::size_t len);Requires: If
len > 0
,key
points tolen
contiguous bytes to be consumed by theHashAlgorithm
. The conversion toresult_type
has not been called on this object since construction, or since*this
was assigned to.Effects: Updates the state of the
HashAlgorithm
using thelen
bytes referred to by{key, len}
.If
len == 0
thenkey
is not dereferenced, and there are no effects.Consider two keys
{k1, len1}
and{k2, len2}
, withlen1 > 0
andlen2 > 0
. Iflen1 != len2
, the two keys are considered not equivalent. Iflen1 == len2
and ifmemcmp(k1, k2, len1) == 0
, the two keys are equivalent, else they are not equivalent.If two instances of
HashAlgorithm
(e.g.h1
andh2
) have the same state prior to an update operation, and given two equivalent keys{k1, len}
and{k2, len}
, then afterh1(k1, len)
andh2(k2, len)
, thenh1
andh2
shall have the same updated state. If{k1, len1}
and{k2, len2}
are not equivalent, then afterh1(k1, len1)
andh2(k2, len2)
,h1
andh2
should have different updated state.Given a key
{k, len}
withlen > 0
, one can create multiple keys each with length li, where the first key k0== k
, and subsequent keys ki == ki-1 + li-1. Combined with a constraint that ∑ li== len
, the single key{k, len}
shall be equivalent to the application of all of the keys{
ki, li}
applied in order.The
HashAlgorithm
shall not access this memory range after the update operation returns. explicit operator result_type();Requires: This operation has not been called on this object since construction or since
*this
was assigned to.Effects: Converts the state of the
HashAlgorithm
to aresult_type
. Two instances of the same type ofHashAlgorithm
, with the same state, shall return the same value. It is unspecified if this operation changes the state of theHashAlgorithm
.Returns: The converted state.
Add a new section to [hash.requirements]:
HashAlgorithm
-basedHash
requirements [hash.algo.hash.rqmts]A type
H
meets theHashAlgorithm
-basedHash
requirements if all of following are met:
H
meets theHash
requirements ([hash.requirements]).
H
is a class template instantiation of the formtemplate <class HashAlgorithm, class Args> struct H;where Args is zero or more type arguments, and the first template parameter meets the
HashAlgorithm
requirements ([hash.algo.rqmts]). TheHashAlgorithm
parameter may be defaulted.
H
has the nested type:using result_type = typename HashAlgorithm::result_type;
H
is either default constructible, or constructible by some documented seed. This construction shall initializeH
.H
may be stateless or have state. If not stateless, different default constructions, and different seeded constructions (even with the same seeds), are not required to initializeH
to the same state.
H
isCopyConstructible
.
H
isCopyAssignable
. template <class T> result_type operator()(T const& t) const;Requires:
HashAlgorithm
shall be constructible as specified by a concreteH
type.Effects: Constructs a
HashAlgorithm h
with automatic storage. Each concreteH
type shall specify howh
is constructed. Howeverh
shall be constructed to the same state for every invocation of(*this)(t)
. Updates the state of theHashAlgorithm
in an unspecified manner, except that there shall be exactly one call to:using std::hash_append; hash_append(h, t);at some time during the update operation. Furthermore, subsequent calls shall update the the local
h
with exactly the same state every time, except as changed by different values fort
, unless there is an intervening assignment to*this
between calls to this operator.Returns:
static_cast<result_type>(h)
.[Note: For the same value of
t
, the same value is returned on subsequent calls unless there is an intervening assignment to*this
between calls to this operator. — end note]
Add a new row to Table 49 — Type property predicates in [meta.unary.prop]:
Template Condition Preconditions template <class T> struct is_contiguously_hashable;A type T
is contiguously hashable if for any two valuesx
andy
of a type, ifx == y
, then it must also be true thatmemcmp(addressof(x), addressof(y), sizeof(T)) == 0
. ItT
is an array type, thenT
is contiguously hashable ifremove_extent_t<T>
is contiguously hashable.T
shall be a complete object type. A program may specialize this trait ifT
is a user-defined type and the specialization conforms to the Condition.
Add a new section to [unord.hash]:
hash_append
[unord.hash_append]template <class HashAlgorithm, class T> void hash_append(HashAlgorithm& h, T const& t);Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<T>::value
istrue
.Effects:
h(addressof(t), sizeof(t))
.For any scalar types
T
, except for member pointers, for whichis_contiguously_hashable<T>{}
evaluates tofalse
, there shall exist an overload ofhash_append
similar to that shown above for contiguously hashable types. For each of these overloads for scalar typesT
, the implementation shall ensure that for two values ofT
(e.g.t1
andt2
), ift1 == t2
, thenhash_append(h, t1)
shall update the state ofh
to the same state as doeshash_append(h, t2)
. And ift1 != t2
, thenhash_append(h, t1)
should update the state ofh
to a different state as doeshash_append(h, t2)
. It is unspecified exactly what signature such overloads will have, so it is not portable to form function pointers to these overloads.[Note:
For example, here is a plausible implementation of
hash_append
for IEEE floating point:template <class HashAlgorithm, class T> enable_if_t < is_floating_point<T>{} > hash_append(HashAlgorithm& h, T t) { if (t == 0) t = 0; h(&t, sizeof(t)); }This implementation accepts the
T
by value instead of byconst&
, and gives -0. and 0. the same bit representation prior to forwarding the value to theHashAlgorithm
(since these two values compare equal).And here is a plausible definition for
nullptr_t
:template <class HashAlgorithm> void hash_append(HashAlgorithm& h, nullptr_t) { void const* p = nullptr; h(&p, sizeof(p)); }— end note]
template <class HashAlgorithm, class T, size_t N> void hash_append(HashAlgorithm& h, T (&a)[N]);Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<T>::value
isfalse
.Effects:
for (auto const& t : a) hash_append(h, t);[Note: It is intentional that the
hash_append
for built-in arrays behave in exactly this way, sending a "message" to theHashAlgorithm
of each element, in order, and nothing else. This "message" to theHashAlgorithm
is considered part of a built-in array's API. It is also intentional that for arrays ofT
that are contiguously hashable, the exact same message is sent to theHashAlgorithm
, except in one call instead of many. — end note]template <class HashAlgorithm, class T0, class T1, class ...T> inline void hash_append (HashAlgorithm& h, T0 const& t0, T1 const& t1, T const& ...t);Effects:
hash_append (h, t0); hash_append (h, t1, t...);
Add a new section to [unord.hash]:
uhash
[unord.hash.uhash]template <class HashAlgorithm = unspecified> struct uhash { using result_type = typename HashAlgorithm::result_type; template <class T> result_type operator()(T const& t) const; };Instantiations of
uhash
meet theHashAlgorithm
-basedHash
requirements ([hash.algo.hash.rqmts]).The template parameter
HashAlgorithm
meets theHashAlgorithm
requirements ([hash.algo.rqmts]). The unspecified default for this parameter refers to an implementation provided defaultHashAlgorithm
.template <class HashAlgorithm> template <class T> typename HashAlgorithm::result_type uhash<HashAlgorithm>::operator()(T const& t) const;Effects: Default constructs a
HashAlgorithm
with automatic storage duration (for example namedh
), and callshash_append(h, t)
(unqualified).Returns:
static_cast<result_type>(h)
.
Add to [type.info]:
class type_info { ... }; template <HashAlgorithm> void hash_append(HashAlgorithm& h, type_info const& t);...
template <HashAlgorithm> void hash_append(HashAlgorithm& h, type_info const& t);Effects: Updates the state of
h
with data that is unique tot
with respect to all othertype_info
s that compare not equal tot
.
Add to the synopsis in [syserr]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, error_code const& ec)
Add to [syserr.hash]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, error_code const& ec)Effects:
hash_append(h, ec.value(), &ec.category());
Add to the synopsis in [utility]:
template <class T, class U> struct is_contiguously_hashable<pair<T, U>> : public integral_constant<bool, is_contiguously_hashable<T>{} && is_contiguously_hashable<U>{} && sizeof(T) + sizeof(U) == sizeof(pair<T, U>)> {}; template <class HashAlgorithm, class T, class U> void hash_append(HashAlgorithm& h, pair<T, U> const& p)
Add a new section to [pairs]: [pairs.hash]:
Hashing pair [pairs.hash]
template <class HashAlgorithm, class T, class U> void hash_append(HashAlgorithm& h, pair<T, U> const& p)Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<pair<T, U>>::value
isfalse
.Effects:
hash_append(h, p.first, p.second);
Add to the synopsis in [tuple.general]:
template <class ...T> struct is_contiguously_hashable<tuple<T...>>; template <class HashAlgorithm, class ...T> void hash_append(HashAlgorithm& h, tuple<T...> const& t)
Add to [tuple.special]:
template <class ...T> struct is_contiguously_hashable<tuple<T...>>;Publicly derives from
true_type
if for eachType
inT...
,is_contiguously_hashable<Type>{}
istrue
, and if the sum of allsizeof(Type)
is equal tosizeof(tuple<T...>)
, else publicly derives fromfalse_type
.template <class HashAlgorithm, class ...T> void hash_append(HashAlgorithm& h, tuple<T...> const& t)Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<tuple<T...>>::value
isfalse
.Effects: Calls
hash_append(h, get<I>(t))
for eachI
in the range [0, sizeof...(T)). If sizeof...(T) is 0, the function has no effects.
Add to the synopsis in [template.bitset]:
template <class HashAlgorithm, size_t N> void hash_append(HashAlgorithm& h, bitset<N> const& bs)
Add to [bitset.hash]:
template <class HashAlgorithm, size_t N> void hash_append(HashAlgorithm& h, bitset<N> const& bs)Effects: Calls
hash_append(h, w)
successively for somew
integral type for which each bit inw
corresponds to a bit value contained inbs
. The lastw
may contain padding bits which shall be set to 0. After all bits have been appended toh
, callshash_append(h, bs.size())
.
Add to the synopsis in [memory.syn]:
template <class HashAlgorithm, class T, class D> void hash_append(HashAlgorithm& h, unique_ptr<T, D> const& p); template <class HashAlgorithm, class T> void hash_append(HashAlgorithm& h, shared_ptr<T> const& p);
Add to [util.smartptr.hash]:
template <class HashAlgorithm, class T, class D> void hash_append(HashAlgorithm& h, unique_ptr<T, D> const& p); template <class HashAlgorithm, class T> void hash_append(HashAlgorithm& h, shared_ptr<T> const& p);Effects:
hash_append(h, p.get());
Add to the synopsis in [time.syn]:
template <class Rep, class Period> struct is_contiguously_hashable<duration<Rep, Period>> : public integral_constant<bool, is_contiguously_hashable<Rep>{}> {}; template <class Clock, class Duration> struct is_contiguously_hashable<time_point<Clock, Duration>> : public integral_constant<bool, is_contiguously_hashable<Duration>{}> {}; template <class HashAlgorithm, class Rep, class Period> void hash_append(HashAlgorithm& h, duration<Rep, Period> const& d); template <class HashAlgorithm, class Clock, class Duration> void hash_append(HashAlgorithm& h, time_point<Clock, Duration> const& tp);
Add a new section to [time.duration], [time.duration.hash]:
duration hash [time.duration.hash]
template <class HashAlgorithm, class Rep, class Period> void hash_append(HashAlgorithm& h, duration<Rep, Period> const& d);Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<duration<Rep, Period>>::value
isfalse
.Effects:
hash_append(h, d.count())
.
Add a new section to [time.point], [time.point.hash]:
time_point hash [time.point.hash]
template <class HashAlgorithm, class Clock, class Duration> void hash_append(HashAlgorithm& h, time_point<Clock, Duration> const& tp);Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<time_point<Clock, Duration>>::value
isfalse
.Effects:
hash_append(h, tp.time_since_epoch())
.
Add to the synopsis in [type.index.synopsis]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, type_index const& ti);
Add to [type.index.hash]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, type_index const& ti);Effects:
hash_append(h, *ti.target);
Add to the synopsis in [string.classes]:
template <class HashAlgorithm, class CharT, class Traits, class Alloc> void hash_append(HashAlgorithm& h, basic_string<CharT, Traits, Alloc> const& s);
Add to [basic.string.hash]:
template <class HashAlgorithm, class CharT, class Traits, class Alloc> void hash_append(HashAlgorithm& h, basic_string<CharT, Traits, Alloc> const& s);Effects:
for (auto c : s) hash_append(h, c); hash_append(h, s.size());[Note: If
is_contiguously_hashable<CharT>{}
istrue
, then the following may replace the loop (as an optimization):h(s.data(), s.size()*sizeof(CharT));— end note]
Add to the synopsis of <array>
in [sequences.general]:
template <class T, size_t N> struct is_contiguously_hashable<array<T, N>> : public integral_constant<bool, is_contiguously_hashable<T>{} && sizeof(T)*N == sizeof(array<T, N>)> {}; template <class HashAlgorithm, class T, size_t N> void hash_append(HashAlgorithm& h, array<T, N> const& a);
Add to the synopsis of <deque>
in [sequences.general]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, deque<T, Allocator> const& x);
Add to the synopsis of <forward_list>
in [sequences.general]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, forward_list<T, Allocator> const& x);
Add to the synopsis of <list>
in [sequences.general]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, list<T, Allocator> const& x);
Add to the synopsis of <vector>
in [sequences.general]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, vector<T, Allocator> const& x); template <class HashAlgorithm, class Allocator> void hash_append(HashAlgorithm& h, vector<bool, Allocator> const& x);
Add to [array.special]:
template <class HashAlgorithm, class T, size_t N> void hash_append(HashAlgorithm& h, array<T, N> const& a);Remarks: This function shall not participate in overload resolution, unless
is_contiguously_hashable<array<T, N>>::value
isfalse
.Effects:
for (auto const& t : a) hash_append(h, t);
Add to [deque.special]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, deque<T, Allocator> const& x);Effects:for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());[Note: When
is_contiguously_hashable<T>{}
istrue
, an implementation may optimize by callingh(p, s)
on suitable contiguous sub-blocks of thedeque
. — end note]
Add to [forwardlist.spec]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, forward_list<T, Allocator> const& x);Effects:typename forward_list<T, Allocator>::size_type s{}; for (auto const& t : x) { hash_append(h, t); ++s; } hash_append(h, s);
Add to [list.special]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, list<T, Allocator> const& x);Effects:for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());
Add to [vector.special]:
template <class HashAlgorithm, class T, class Allocator> void hash_append(HashAlgorithm& h, vector<T, Allocator> const& x);Effects:for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());[Note: When
is_contiguously_hashable<T>{}
istrue
, an implementation may optimize by callingh(x.data(), x.size()*sizeof(T))
in place of the loop. — end note]
Add to [vector.bool]:
template <class HashAlgorithm, class Allocator> void hash_append(HashAlgorithm& h, vector<bool, Allocator> const& x);Effects: Calls
hash_append(h, w)
successively for somew
integral type for which each bit inw
corresponds to a bit value contained inx
. The lastw
may contain padding bits which shall be set to 0. After all bits have been appended toh
, callshash_append(h, x.size())
.
Add to the synopsis of <map>
in [associative.map.syn]:
template <class HashAlgorithm, class Key, class T, class Compare class Allocator> void hash_append(HashAlgorithm& h, map<Key, T, Compare, Allocator> const& x); template <class HashAlgorithm, class Key, class T, class Compare class Allocator> void hash_append(HashAlgorithm& h, multimap<Key, T, Compare, Allocator> const& x);
Add to the synopsis of <set>
in [associative.set.syn]:
template <class HashAlgorithm, class Key, class Compare class Allocator> void hash_append(HashAlgorithm& h, set<Key, Compare, Allocator> const& x); template <class HashAlgorithm, class Key, class Compare class Allocator> void hash_append(HashAlgorithm& h, multiset<Key, Compare, Allocator> const& x);
Add to [map.special]:
template <class HashAlgorithm, class Key, class T, class Compare class Allocator> void hash_append(HashAlgorithm& h, map<Key, T, Compare, Allocator> const& x);Effects:
for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());
Add to [multimap.special]:
template <class HashAlgorithm, class Key, class T, class Compare class Allocator> void hash_append(HashAlgorithm& h, multimap<Key, T, Compare, Allocator> const& x);Effects:
for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());
Add to [set.special]:
template <class HashAlgorithm, class Key, class Compare class Allocator> void hash_append(HashAlgorithm& h, set<Key, Compare, Allocator> const& x);Effects:
for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());
Add to [multiset.special]:
template <class HashAlgorithm, class Key, class Compare class Allocator> void hash_append(HashAlgorithm& h, multiset<Key, Compare, Allocator> const& x);Effects:
for (auto const& t : x) hash_append(h, t); hash_append(h, x.size());
Add to the synopsis in [complex.syn]:
template <class HashAlgorithm, class T> void hash_append(HashAlgorithm& h, complex<T> const& x);
Add to [complex.ops]:
template <class HashAlgorithm, class T> void hash_append(HashAlgorithm& h, complex<T> const& x);Effects: Calls
hash_append(h, x.real(), x.imag())
.
Add to the synopsis in [thread.thread.id]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, thread::id const& id);
Add to [thread.thread.id]:
template <class HashAlgorithm> void hash_append(HashAlgorithm& h, thread::id const& id);Effects: Updates the state of
h
withid
.
Thanks to Daniel James (et al.) for highlighting the problem of hashing zero-length containers with no message.
Thanks to Dix Lorenz (et al.) for pointing out that the
result_type
of the HashAlgorithm
need not be
size_t
, and indeed, can not be if we want this infrastructure to
fully handle cryptographic hash functions (which produce results larger than
a size_t
).
Thanks to Jeremy Maitin-Shepard for pointing out problems in an earlier scheme
to hash std::string
and arrays of char identically. Also thanks
to Jeremy and Chris Jefferson for their guidance on hashing unordered sequences.
Additional thanks to Walter Brown, Daniel Krügler, and Richard Smith for their invaluable review and guidance.
This research has been generously supported by Ripple Labs. We would especially like to thank our colleagues on the RippleD team.