Author: | JF Bastien |
---|---|
Contact: | jfb@google.com |
Author: | Olivier Giroux |
Contact: | ogiroux@nvidia.com |
Date: | 2015-10-24 |
Previous: | http://wg21.link/N4523 |
URL: | https://github.com/jfbastien/papers/blob/master/source/P0154R0.rst |
Starting with C++11, the library includes std::thread::hardware_concurrency() to provide an implementation quantity useful in the design of control structures in multi-threaded programs: the extent of threads that do not interfere (to the first-order). Established practice throughout the industry also relies on a second implementation quantity, used instead in the design of data structures in the same programs. This quantity is the granularity of memory that does not interfere (to the first-order), commonly referred to as the cache-line size.
Uses of cache-line size fall into two broad categories:
The most sigificant issue with this useful implementation quantity is the questionable portability of the methods used in current practice to determine its value, despite their pervasiveness and popularity as a group. In the appendix we review several different compile-time and run-time methods. The portability problem with most of these methods is that they expose a micro-architectural detail without accounting for the intent of the implementors (such as we are) over the life of the ISA or ABI.
We aim to contribute a modest invention for this cause, abstractions for this quantity that can be conservatively defined for given purposes by implementations:
In both cases these values are provided on a quality of implementation basis, purely as hints that are likely to improve performance. These are ideal portable values to use with the alignas() keyword, for which there currently exists nearly no standard-supported portable uses.
Below, substitute the � character with a number the editor finds appropriate for the sub-section. We propose adding the following to the standard:
Under 20.7.2 Header <memory> synopsis [memory.syn]:
namespace std {
// ...
// 20.7.� Hardware interference size
static constexpr size_t hardware_destructive_interference_size = implementation-defined;
static constexpr size_t hardware_constructive_interference_size = implementation-defined;
// ...
}
Under 20.7.� Hardware interference size [hardware.interference]:
constexpr size_t hardware_destructive_interference_size = implementation-defined;
This number is the minimum recommended offset between two concurrently-accessed objects to avoid additional performance degradation due to contention introduced by the implementation. It shall be a valid alignment value for any type.
[Example:
struct apart {
alignas(hardware_destructive_interference_size) atomic<int> flag1, flag2;
};
— end example]
constexpr size_t hardware_constructive_interference_size = implementation-defined;
This number is the minimum recommended alignment of contiguous memory occupied by two objects accessed with temporal locality by concurrent threads. It shall be a valid alignment value for any type.
[Note: This number is also the maximum recommended size of contiguous memory occupied by two objects accessed in this manner. — end note]
[Example:
alignas(hardware_constructive_interference_size) struct colocated {
atomic<int> flag;
int tinydata;
};
static_assert(sizeof(colocated) <= hardware_constructive_interference_size);
— end example]
The __cpp_lib_thread_hardware_interference_size feature test macro should be added.
We informatively list a few ways in which the L1 cache-line size is obtained in different open-source projects at compile-time.
The Linux kernel defines the __cacheline_aligned macro which is configured for each architecture through L1_CACHE_BYTES. On some architectures this value is determined through the configure-time option CONFIG_<ARCH>_L1_CACHE_SHIFT, and on others the value of L1_CACHE_SHIFT is hard-coded in the architecture’s include/asm/cache.h header.
Many open-source projects from Google contain a base/port.h header which defines the CACHELINE_ALIGNED macro based on an explicit list of architecture detection macros. These header files have often diverged. A token example from the autofdo project is:
// Cache line alignment
#if defined(__i386__) || defined(__x86_64__)
#define CACHELINE_SIZE 64
#elif defined(__powerpc64__)
// TODO(dougkwan) This is the L1 D-cache line size of our Power7 machines.
// Need to check if this is appropriate for other PowerPC64 systems.
#define CACHELINE_SIZE 128
#elif defined(__arm__)
// Cache line sizes for ARM: These values are not strictly correct since
// cache line sizes depend on implementations, not architectures. There
// are even implementations with cache line sizes configurable at boot
// time.
#if defined(__ARM_ARCH_5T__)
#define CACHELINE_SIZE 32
#elif defined(__ARM_ARCH_7A__)
#define CACHELINE_SIZE 64
#endif
#endif
#ifndef CACHELINE_SIZE
// A reasonable default guess. Note that overestimates tend to waste more
// space, while underestimates tend to waste more time.
#define CACHELINE_SIZE 64
#endif
#define CACHELINE_ALIGNED __attribute__((aligned(CACHELINE_SIZE)))
We informatively list a few ways in which the L1 cache-line size can be obtained on different operating systems and architectures at runtime. Libraries such as hwloc perform these queries, and could also be added to the standard as a separate proposal.
On OSX one would use:
sysctlbyname("hw.cachelinesize", &cacheline_size, &sizeof_cacheline_size, 0, 0)
On Windows one would use:
GetLogicalProcessorInformation(&buf[0], &sizeof_buf);
for (i = 0; i != sizeof_buf / sizeof(SYSTEM_LOGICAL_PROCESSOR_INFORMATION); ++i) {
if (buf[i].Relationship == RelationCache && buf[i].Cache.Level == 1)
cacheline_size = buf[i].Cache.LineSize;
On Linux one would either use:
p = fopen("/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size", "r");
fscanf(p, "%d", &cacheline_size);
or:
sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
On x86 one would use the CPUID Instruction with EAX = 80000005h, which leaves the result in ECX, which needs further work to extract.
On ARM one would use mrs %[ctr], ctr_el0, which needs further work to extract.