Document number: N3333=12-0023
Date:
Jeffrey Yasskin <jyasskin@google.com>
Chandler Carruth <chandlerc@google.com>
C++11 defined a set of standard hashing containers, which rely on a specialization of std::hash<KeyType>
to exist in order to hash their keys. However, we provided no help to users trying to implement this hash function. We also required them to specialize a template in namespace std
as opposed to the namespace their code actually lives in. This leads to confused users and weak hash functions. For example:
namespace my_namespace {
struct OtherType; // Defined elsewhere.
struct MyType {
OtherType field1;
int field2;
char field3[500];
};
}
namespace std {
template<> struct hash<::my_namespace::MyType> {
size_t operator()(const MyType& val) {
return (hash<::my_namespace::OtherType>()(field1) ^
// Wow, that's verbose, and the xor makes it weak.
hash<int>()(field2) ^
// Oh noes, a copy:
hash<std::string>()(field3, field3 + 500));
}
};
}
This paper proposes an improvement to the situation. But first a digression on the purpose of "hash" functions:
There are several uses for this thing we refer to as hashing.
No single algorithm is optimal for all of the above use cases. If we try to make std::hash<>
fill all of those roles, it will be the wrong choice for all of them instead.
This paper proposes that std::hash<>
be designed for only the first use in the above list, and focuses on making it as simple as possible for the average user of standard library hashing containers. A later paper is likely to explore the more sophisticated interface needed to make user-defined types visible to other "fingerprinting" algorithms.
Hash tables only require that std::hash<T>()(object)
return a stable value within a single program execution. In theory this allows the standard library to improve the hash implementation over time. However, we found at Google that without very clear documentation, we upgrade hash implementations rarely enough that developers save the values to disk anyway, which causes upgrades to break code. Therefore, we propose that hash functions return a different value in different processes, not just in different implementations, so that programs fail fast when they're incorrect. This has the secondary benefit that it frustrates pre-calculated collision tables used for denial-of-service attacks.
At Google, we find many users who are confused about what kinds of things they're allowed to put in namespace std
. Requiring them to specialize std::hash<>
encourages them to define new things in there, which causes undefined behavior.
Even more users dislike the verbosity caused by the need to close out their current namespace, open the std namespace, specialize hash, close the std namespace, and then re-open their original namespace.
To fix this, we propose finding hash_value()
by argument-dependent lookup, like swap
. std::hash<T>::operator()(t)
is redefined to forward to hash_value(t)
.
This leaves the problem of how to hash built-in types. In c++std-lib-31719, Peter Dimov observed that a "practical problem with the hash_value design is that types implicitly convertible to bool such as shared_ptr have a hash_value [of] 0 or 1 unless overridden." C++11's std::shared_ptr
only has an explicit conversion to bool, so this objection doesn't apply there, but it would be poor form to lay such traps for user-defined types that aren't using new C++11 features. Instead, define:
namespace std {
template<typename Bool>
typename enable_if<is_same<Bool, bool>::value, size_t>::type
hash_value(Bool b) { return ...; }
}
struct BoolConvertible { operator bool() { return true; } };
void f() {
BoolConvertible bc;
using std::hash_value;
hash_value(bc); // error
}
This templating should be done for every hash_value(primitive)
overload to avoid similar conversion mistakes. Other library, extension, or language techniques to prevent implicit conversions to the argument types should also be allowed.
In order to keep compatibility with existing specializations of std::hash
, the standard library should continue calling std::hash<T>()(value)
where it needs a hash value, but user code that doesn't need such compatibility can move to the more natural hash_value(value)
.
Templated user code needs to use the more verbose
using std::hash_value;
hash_value(value);
in order to pick up the definitions for primitive types. Perhaps the standard library should also include an adl_hash_value()
with that definition to make this a bit less verbose. On the other hand, only hash table implementations are likely to call hash_value
, so adl_hash_value
may not be worth it.
hash_value()
If users are asked to implement a hash function for their own types with no guidance, they generally write bad hash functions. Instead, we should provide a simple function to pass hash-relevant member variables into, in order to define a decent hash function:
namespace my_namespace {
struct OtherType { ... };
struct MyType {
OtherType field1;
int field2;
char field3[500];
float field4; // Does not contribute to equality.
};
std::hash_code hash_value(const MyType& val) {
return std::hash_combine(val.field1,
val.field2,
val.field3);
}
}
std::hash_code
will be explained and defined later.
hash_combine()
hash_combine()
is declared as:
namespace std {
template<typename T, typename... U>
hash_code hash_combine(const T& arg1, const U& ...args);
}
and has an implementation-defined definition satisfying the following properties:
hash_combine(args...)
interacts with user-defined types by calling unqualified hash_value()
on them.all_of((args1 == args2)...)
then hash_combine(args1...) == hash_combine(args2...)
.hash_combine()
aren't defined to return equal values, then their return values must be completely different with high probability See http://en.wikipedia.org/wiki/Avalanche_effect for a possible (but not the only possible) meaning of "completely different with high probability". Situations that don't require equal hash_combine
results include:
hash_combine()
's arguments. (This includes changing hash_combine(false)
to hash_combine(true)
.)hash_combine
in a different execution of the same binary.hash_combine(arg1, arg2, arg3)
with hash_combine(hash_combine(arg1, arg2), arg3)
A downside to this approach is that modern hash functions contain initialization and finalization phases that will be re-run for each call to hash_combine
, where they could be skipped if the hash algorithm knew it was only dealing with an intermediate result. The authors of CityHash (Geoff Pike) and MurmurHash (Austin Appleby) don't think this inefficiency will be a significant problem in practice.
Ensuring that the result of hash_combine
is different in different processes requires picking a seed at process start time. I know of three ways to do this:
<iostream>
: Define a static object in the header that defines hash_combine()
, and have it initialize the global seed if it's the first such object to be constructed.hash_combine
like:
atomic<seed_t> global_seed;
hash_combine(...) {
seed_t seed = global_seed.load(memory_order_relaxed);
if (seed == 0) seed = initialize_global_seed_to_nonzero();
...
}
get_seed()
function as:
seed_t get_seed() {
static const seed_t seed = initialize_global_seed();
return seed;
}
The implementation of this is likely to be similar to, but not quite as efficient as, option 2.Implementations should provide a way for users to set the seed explicitly, primarily to track down bugs that result from assuming a stable hash value.
This is actually stricter than the current requirements on the Hash concept and std::hash
, which only require that k1==k2
imply h(k1)==h(k2)
for a single hash instance 'h
'. We're tightening this to require that two std::hash<T>
instances in the same process return the same value for equal inputs.
It's possible to hash ranges of values using code like:
hash_code hash_value(const MyType& val) {
hash_code result = 0;
for (const auto& v : val) {
result = hash_combine(result, c);
}
return result;
}
However, this code is likely to be slower than it needs to be because the hash algorithm needs to initialize and finalize for each call to hash_combine
. Instead, we define:
namespace std {
template<typename InputIterator>
hash_code hash_combine_range(InputIterator first,
InputIterator last);
}
which combines hashes for each element in the range. If we had a std::range<Iterator>
object, we could define hash_value(std::range)
directly in terms of this. The semantics of the combining for both hash_combine
and hash_combine_range
are the same, leading to the following invariants:
std::vector<int> v = {1, 2, 3};
assert(std::hash_combine(1, 2, 3) ==
std::hash_combine_range(v.begin(), v.end()));
high_probability(std::hash_combine(1, 2, 3) !=
std::hash_combine(
std::hash_combine_range(v.begin(), v.end())));
// And likely following from the above:
assert(std::hash_combine(1, 2, 3) != std::hash_combine(v));
Note that this requires that ranges of equal values hash equally, even if they have different types. This could require more implementation complexity in implementations that want to hash ranges of characters very efficiently, but it appears doable.
Users may also want to improve efficiency by hashing several sequential standard-layout and trivially-copyable fields as a single byte array, rather than passing them each individually to hash_combine
. Unfortunately, those requirements aren't enough: the fields also have to be free of padding. To handle this, we propose the following type trait and a third hashing function.
namespace std {
template<typename T> is_contiguous_layout;
template<typename T>
hash_code hash_as_bytes(const T& obj);
}
std::is_contiguous_layout<T>
shall be a UnaryTypeTrait whose BaseCharacteristic is: std::true_type
if all bits of the object representation participate in the value representation, or std::false_type
otherwise. In other words, if there are any padding bits in the bytes making up an object of type T
, then is_contiguous_layout<T>
inherits from false_type
. is_contiguous_layout
would also be useful in optimizing uses of compare_exchange
.
hash_as_bytes(obj)
is ill-formed (a diagnostic is required) if the type T of obj
does not satisfy
is_trivially_copyable<T>::value &&
is_standard_layout<T>::value &&
is_contiguous_layout<T>::value
hash_as_bytes(args...)
is implemented as:
unsigned char* addr = reinterpret_cast<unsigned char*>(&obj);
return hash_range(addr, addr + sizeof(obj));
If a user-defined type has a subset of fields that satisfy the requirements for hash_as_bytes()
, its author can place those fields in a sub-struct.
Note: We considered allowing users to pass multiple objects to hash_as_bytes
, and having it treat them as a single long byte array, optimizing cases where objects are adjacent. However, because fast hash functions operate block-at-a-time, this would be difficult to implement without either copying data or changing the resulting hash value.
It would be possible to perform the above optimization automatically inside some calls to hash_combine()
where arguments are adjacent and appropriate sizes, but Austin Appleby believed this was likely to surprise people, so we have a way for users to request it explicitly.
hash_code
typeThe authors of this paper initially believed that having all intermediate hash values bottleneck through size_t
would increase the collision rate and reduce performance. Geoff Pike and Austin Appleby agreed that size_t
does increase the collision rate and reduce performance slightly, but not enough to justify the extra complexity of avoiding it. So, algorithmically, it would be fine to have all hash functions return size_t
.
However, Austin argued that "if hash()
doesn't behave like a function in the mathematical 'y = f(x)
' sense due to the intentionally unstable implementation, then hash_code
should not behave like a number and should be as opaque as possible" in order to avoid user confusion. We've decided to have the three hash utility functions return a new hash_code
object:
namespace std {
class hash_code {
// Implementation-specific state; probably size_t.
public:
// For compatibility with existing hash functions
hash_code(size_t value);
// Copyable, assignable.
hash_code(const hash_code&);
hash_code& operator=(const hash_code&);
// Explicit conversion to size_t.
explicit operator size_t() const;
}
bool operator==(const hash_code&, const hash_code&);
bool operator!=(const hash_code&, const hash_code&);
hash_code hash_value(const hash_code&);
}
For backward compatibility with existing hash tables using std::hash<T>
, its operator()
still needs to return size_t
.
A single hash function that's easy to implement isn't enough for all use cases. The standard library should offer a collection of stable Fingerprint algorithms so that users can share their results between different implementation, and get more specialized characteristics in the output. These algorithms need a way to traverse user-defined types, and the above hash_value()
function doesn't directly generalize. It is outside the scope of this paper to discuss approaches to generic fingerprinting algorithms.
Thanks to Geoff Pike (the author of CityHash) and Austin Appleby (the author of MurmurHash) for providing expert guidance. Thanks to Lawrence Crowl, Matt Austern, and Richard Smith for helping with the C++ interfaces and editing this paper.