Doc. No.: | WG21/N4199 |
---|---|
Date: | 2014-10-10 |
Contact: | Hans-J. Boehm (who did not actually take the notes) |
Email: | hboehm@google.com |
We met two full days. We discussed various topics as detailed below:
A major goal of this meeting was to arrive at a consensus about a direction that might allow us to re-integrate executors into the Concurrency TS draft:
Executors Chris Mysen N3785++:
Combining earlier proposals. Exposition of the document.
Hans: Replacement for async, which was agreed to be broken.
Some thread-pool behavior. Includes timers, needed for GUI functionality. Skipping wrap, controversial. Added convenience functions such as spawn_at.
Question: Is returning zero always a correct implementation? Answer: Yes, meets the specification, but useless. Answer: But does not actually make progress, so doesn't really do anything useful.
Code often has created its own executors, and therefore knows how much work there is. Distinction between executor as a service and as specialized mechanism. For example, do you need to wait for all work to have drained?
Don't want to discuss Chris Kohlhoff's proposal, conflict of interest. Some additional use cases in Chris Kohlhoff's proposal, but these should be quite rare. Proposals mostly converged, but some differences remain.
Chandler: Overloaded free function, spawn with continue-with, and other things from Chris Kohlhoff's paper.
Michael Wong: Are these two proposals converging? Chris Mysen: Some convergence due to feedback, but a few core differences remain.
Begin-work/end-work brackets to keep an executor alive in Chris Kohlhoff's paper, not in Chris Mysen's paper. In theory, could chain operations to get a similar effect.
Scheduling executors, core in Chris Mysen's paper, not in Chris Kohlhoff's paper. Some disagreements as to what is core and not.
Interactions with packaged task.
Windows GUI has timers in API, so might want to use the Windows timers when running on Windows, so that requiring use of timers in executor might not be helpful.
Three spawn variants vs. single spawn method. Email on the reflector from a few hours ago justifying the three methods. Differences come down to optimizations and whether and which optimizations need to be baked into the core of the proposal. Chris Mysen is concerned that the three methods results in fuzzy semantics. Chris Mysen wants standardized support for the most common use cases. Detlef produced a sample implemntation, but defined two of the spawn variants in terms of the third functions.
Do these three spawn variants provide a minimal and complete set? Guaranteeing fairness? Lots of other options that are sometimes useful, and some of them incur non-trivial expense.
Parallel STL provides an open-ended parameter for these sorts of issues. Are separate APIs really better than thss sort of parameter? But this parameter is constrained tightly. Note that std::async had an open-ended parameter, and although std::async didn't do well, the open-ended parameter might be a good idea.
Optimizations represented by the three spawn methods are valuable in cases involving blocking communication. Potentially use multiple executors get the effect of the multiple spawn methods. Some debate over API methods vs. open-ended parameter to single API. Reference to email discussion calling out disadvantages of the open-ended parameter (c++std-parallel-1011). Hints vs. hard-core semantics.
Chandler for Jeffrey Yasskin: Defer doesn't really provide a real semantic difference. Concern about the constraints on where dispatch can be called and what the called function can do. For example, if you invoke dispatch in a loop, you can overflow your stack. Probably want some variant of dispatch, but need to understand how fork-join parallelism is implemented using work-stealing approach. Otherwise, we don't know enough to complete the design. (Google's executors have dispatch, but requires use of a special background task that provides continuation service. Most of these services are provided by dedicated background threads, motivated by earlier concerns about stack overflow. Dispatch quite useful in networking applications, even without concurrency.)
Microsoft has provided dispatch, and it is frequently used, but frequently misused.
If people are not provided dispatch functionality, they invent their own, usually badly.
Defer is an affinity hint: "this thing will happen after this one." Suppose you have a resource that performs very poorly under contention. Suppose that you need the shared resource, but aren't delayed by it. In that case, defer provides a hint that the resource should wait until other work arrives or some time passes before processing the requests.
Exmaple of defer: fixed-sized thread pool. If thread immediately running, go ahead, otherwise wait for other work arrives.
dispatch: Can run on current thread immediately.
post: Could run, but not on current thread until thread is done with current task.
defer: Could run, but not on current thread (until thread is done with current task), and only if some other thread is already running.
Think of an executor as being a thread pool.
Chandler: For continuation-passing variant, use of the separate name seems very unfortunate. Would rather have spawn return a future depending on whether or not it was passed a continuation. Could be a simple overload if the continuation was passed in as a separate lambda parameter. Could be an empty lambda, silly, but possible.
Discussion of separate name for the case where a future was required.
Hans: Multiple parameters tricky, tend to use a closure in those cases.
Lawrence: Closure can be opaque, but templatization of multiple parameters must be evaluated. Though few care about compiler performance anymore, and few compilers care.
Pablo: In what context is the continuation running? A: In the same thread as the spawning code is running in. Pablo: So why is this needed? A: Possibly spawn on executor. Pablo: That might be useful. Chandler: Similar to defer. But you could also have a post scoped guard.
Hans: Why two lambdas? Why not three or more? A: Leads into question on task composition. Have a list of lambdas? Zero, one, and many. From functional-programming perspective, we have right-fold.
SG1 email from Detlef discusses this. See c++std-parallel-1074. Possible difference in expected duration of task: Kohlhoff seems to focus on short tasks, and Mysen seems to focus on long tasks. Detlef: About two weeks ago, attached a paper to the wiki and send email to reflector with use case involving a pipeline, in which it is necessary to align priority of task with underlying executor. See c++std-parallel-1002.
Chandler: Include wrappers that export wrapper-executors that have priorities included. Detlef: I don't see how this handles the pipeline. Paul: Priority inversion? Chandler: Executor context (including priority) is part of larger implementation-specific construct.
Nat: Proposal: Start with Chris Mysen proposal, strip down. Create whatever parameters are desired by binding them into a lambda instead. Detlef: But that requires that I use my specific scheduler.
What should be brought to Urbana-Champaign meeting?
Chandler: OK with dispatch in TR proposal, but not in standard itself until we understand it better. Need proposal for dispatch-like functionality on top of Chris Mysen's proposal. This functionality is supported by Chris Kohlhoff's proposal. Want Chris Mysen paper, but with controversial separable parts removed. At least want something that takes a function and returns a future in order to replace std::async. Lawrence: But might need to change core API to add desired features. Chandler: Which is why this needs to be experimental initially.
Timed executors? Separable. Thomas Rodgers: I am OK as long as I can schedule something to execute at a particular time or after that time. Can timed executors be separate concept/mechanism? Chandler: Yes.
Hans: Start with Chris Mysen's proposal, see what of Chris Kohlhoff's proposal cannot be implemented in terms of Chris Mysen's proposal, and bring that work into the Urbana-Champaign meeting.
Chris Kohlhoff likely to be building networking handling based on executors, which will need time-based executors for timeouts. Potentially use Chris Mysen's proposal to implement Chris Kohlhoff's proposal. Jamie Alsop: Removing facilities and adding them back might lose something. Torvald: Could this be developed in the open?
Peter Sommerlad: Should we leave out timed spawn? Detlef: Why are times more special than priorities? Chandler: Happens more frequently. Detlef: Say what? Hans: Different application areas. Detlef: Instead of adding timing to base proposal, propose base function and way to add both.
Time as source of events, separate from execution portions. Chandler: Implementation experience argues otherwise.
Straw polls:
Herb Sutter: Atomic Smart Pointers (N4058) http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4058.pdf
Motivation: Producer/consumer handoff. Use unique_ptr<X>, as you would in non-atomic code. Just need atomic<> in the shared case.
Peter Sommerlad: Initialization? A: Default initialization to nullptr. Hans: There are open issues about this, currently uninitialized. (See LWG 2334)
Straw poll: Half or so of the room support the code at the bottom of page 2 of N4058.
ABA problem. Requires hazard pointers, GC, TM, or some such (including of course, RCU). Unless there are no cycles. This essentially uses reference counting, which has long been known to work as a GC in the absence of cycles in the data structure.
Currently need to explicitly specify atomics on all uses. Would like to be able to say atomic<shared_ptr<Node>>.
Anthony Williams: Shared-pointer imposes overhead on all operations rather than just on modifications. A: Can always switch to alternative more-scalable implementations. (Restated) My point is that atomic shared-pointer is an improvement over use of atomic operations on normal shared pointer.
Section 1.4: Motivating examples for weak_ptr.
o Q&A from N4058.
Straw polls:
Request for experimentation for code on Page 7 of n4058:
CPLEX welcomes feedback. The code is intended to work in the intersection of C and C++. CPLEX is leaving open adoption by WG21. No need for compatibility on types yet. Reducers would be an issue. Intended to have a common implementation, if not surface syntax. Intended to use the OpenMP scheduler, which would make the code work with OpenMP as well.
C must implement an implicit closure. There is a potential problem with not having a user-definable closure.
for ( int i = 0; i < N; ++i ) { _Task_parallel spawn { f(i); /* Race on i */ } }
A closure enables writing new parallel control constructs.
Prior art includes the WG14 'blocks' proposal and the WG21 'lambda' proposal. A conforming C compiler is a small task. Lambda might make it harder. Parallelism would also make it harder.
Need task region.
Separate spawn into separate functions with "_Task_parallel call". Do we get exponential increase in function with transactional memory also? No, not a problem.
Clang has issues with growing the language. Some doubt about whether the need for parallelism in C is better handled by other solutions around the language.
OpenMP has a task region that is not the same thing as C++ task region. We probably need to change names.
What about exceptions in spawn? You cannot destroy variables before sync in block. Undefined behavior due to races.
block { { int a; spawn { .... (a); } ++a; // races with use in spawn // 'a' going out of scope races with use in spawn } { int b; .... } }
Mark Batty: C Concurrency Challenges (Memory Model)
Data-race-freedom does not imply sequential consistency in C11 due to mix of atomics and non-atomics. (It -does- apply to finite C++11 programs.)
Thin-air problem. No precise definition of "out of thin air values". C++11 and C11 claim to prohibit them, but don't really. C++14 just states that out-of-thin-air values are forbidden.
Relaxed atomics must allow load buffering. Relaxed loads and stores must be allowed to pass each other. But introducing data dependencies should forbid this (see slide 11 of the presentation.) The language does not forbid it, but compilers and hardware do not let it happen. The language should forbid it.
But compiler optimizations can make themselves felt, for example, dependency-breaking optimizations.
Herb, Chandler: But we cannot allow speculative stores of atomics! A: Yes, but we need to specify that. Much discussion about possible rules, optimizations, and so on.
"Solution": Preserve all dependencies among atomics. But it is very convenient to allow dependency-breaking optimizations in separately compiled functions. Could introduce false dependencies, but current compilers destroy the information required to construct the false dependency. Bad solution!!!
C11 mixing of non-atomics and atomics:
C11 and visible side effects. But if non-atomic write in the mix... Example using memcpy().
Non-atomic read condition.
size_t s = sizeof(atomic_int); atomic_int n = 0, x = 0, y = 0, z = 0; // Thread 1: x.store(1, relaxed); z.store(1, release); x.store(2, relaxed); y.store(&x, release); // Thread 2: do { r1 = z.load(acquire); r2 = y.load(consume); } while (r1 != 0 && r2 != 0); memcpy(&n, r2, s); memcpy(&n, &x, s);
If we allow non-atomic memcpy() accesses, then the memcpy()s' loads from x violate modification order. Note that if the memcpy()s are instead relaxed loads, modification order is respected. Not a problem in C++11 because atomics are non-trivially copyable, and thus we cannot (legally!) do the type punning.
We do want to mix atomics and non-atomics, but only if there is sufficient happens-before between any atomic and non-atomic access to the same variable. There is still a desire to do all this portably. One thought is to have a .raw access, and explicitly change the phase between .raw and normal atomic.
Of course, we still need to solve the out-of-thin-air problem.
And atomic reads don't do any better:
size_t s = max(sizeof(atomic_int), sizeof(int)); atomic_int *x = calloc(1, s); atomic_int *y = calloc(1, s); int *r1 = calloc(1, s); int *r2 = calloc(1, s); // Thread 1: memcpy(r1, x, s); if (*r1 != 0) { atomic_init(*y, 1); store_explicit(*y, 2, seq_cst); } // Thread 2: memcpy(r2, y, s); if (*r2 != 0) { atomic_init(*x, 1); store_explicit(*x, 2, seq_cst); }
Chandler: libc++11 is implemented strictly in terms of C++11 atomics. Cannot get C++11 atomic out to a C interface. And this library has atomic_int be a different type than std::atomic<int>. And C++11 atomic_int is a typedef of std::atomic<int>.
Summary: DRF-SC holds in C++1 if no atomic/non-atomic mix. Thin-air still present, and no fix in current framework. Thin-air leads to DRF-SC violation if atomic/non-atomic mixing allowed. The paper presents different semantics that fixes some problems but introduces others.
Chandler: C could implement its atomics in terms of C++.
JF Bastien presents
overall minor issue, discussed minor issue
LWG 2334 and WG14 DR 431: fix both in one go
padding bits must be considered
different options: forbid "has_padding_bits<structtype>()()" is true in atomic<structtype>
disallow compare_exchange_strong for structs, because of padding bits
JF it looked easy to fix, but it seems to be harder
"let's have a state where you know the value of the padding bits" idea, doesn't work, because a non-atomic version of the struct would not know the padding bits are. while an atomic<structtype> would know the padding bits any structtype value you wouldn't know that.
LWG2334 suggestion doesn't work as all other possible fixes.
Lawrence: load-store work, but not compare_exchange_strong, because of the padding bits value
L: we don't want to have anything but a memcmp/memcpy view of the impelementation
L: compiler should tell if compare_exchange_strong is used with a type with padding bits and warn about, but users might get correct behavior, if they pay specific attention to the value of padding bits
L: problem is that users can write non-working code that they are not notified about. Compilers should tell about without forbidding correct uses
JF: should synchronize with with LWG and tell LWG that issue can not be fixed in combination and LWG should go forward wrt
standard says it won't work with padding bits because of memcmp/memcpy
should we add wording to document restrictions and/or allow divergence to C atomics and allow constructors/memset or specify, if you have padding bits, you have to use a lock. Lawrence: we have the same problem with floats, because values compare equal with different bit patterns (Olivier: floats are an exception anyway)
Olivier: if we have padding bits on one machine: we have to use locks, but still be portable, if you don't have padding bits, its the same
Hans: is there difference
Paul: it doesn't make padding bits
Olivier: nobody should have padding bits. suggestion is to be able to detect "has_padding<T>" and use a lock and do member-wise comparison if you have.
Lawrence suggest compilers must squash padding bits to zero before atomic operation instead
We can not travel back in time
JF: compare_exchange of a struct with padding bits is a bad idea: forbid it, fix it, Paul: make it implementation-defined
Lawrence: a careful programmer might be able to write working code with compare_exchange and padding bits. JF/Olivier: not sure, allowed by the spec that the padding bits might not be passed, when passed by value
Lawrence: suggests diagnostics when compare_exchange_strong would not work
Chandler: compilers might actively try not to copy padding bits (when passing by value), but compare_exchange_weak will work, because passed by reference (will converge)
Lawrence: compiler should warn about padding bits and memcmp wouldn't work (also for float)
Chandler: any floating point comparison in atomic context should to memory equality (+0 = -0) not floating point value equality (+0 == -0)
options: emit a warning vs. don't compile if padding bits
has_padding_bits<T> might be useful anyway (for hashing)
compilers QOI issue: should warn if there are padding bits in atomic context
JF: OK, but no warning for floats, Chandler wants bit-pattern equality for floats on compare_exchange
discussion on floating point representation ensues
Pablo: what about other non-unique representations of true. Chandler objects: implementations must chose a single representation for bool values.
JF we tend to agree compare_exchange_strong is bad with structs on padding bits
Lawrence: NAD, QOI
JF: resolution: WG14 DR 431: NAD QOI
Olivier: would wish has_padding_bits<T> traits to detect the bad cases (needed also for hashing and related things)
JF: LWG 2334 resolution is a probably the right fix (invoking ctor)
JF: new paper with that content for next meeting
passing "desired" by value can invent values for padding bits. Chandler: it would be allowed to specify the value of padding bits, but it is not enforced by the standard, e.g., zero initialize
Paul: a compiler might not copy padding bits on copy-ctor, but such a compiler is allowed to fix the value of the padding bits
Chandler: we could add a requirement that the by-value parameter "desired" is normalized wrt to padding bits
Pablo: could be pass it by reference, JF doesn't work, because load returns by value
Pablo is a bit concerned. JF: stick to suggest a warning if padding bits are there
Pablo: can we put LWG 2334 into tentatively ready state for Urbana, avoid too lengthy discussion in Urbana, to avoid taking up time.
break
Lawrence thinks LWG2334 proposed resolution is horrible. It seems to make atomics into a container and doesn't help with initializing the atomic's locks. One still needs to call ATOMIC_INIT the spec is not helpful, otherwise there are different means to initialize an atomic. He prefers that atomic<T> default ctor is "uninitialized". T must have a trivial copy-ctor, even if its default ctor is not trivial. atomic<T> does not contain a T.
atomic<T> does not embed a T object. JF: Expectation is that atomic<T> "contains" a T, but it doesn't.
Chandler: I did the same mistake.
Lawrence: you can not get a value of T from atomic<T> unless it is initialized and then you have to specify the initial value for T that is copied.
Pablo: the locks in an atomic<T> might not be initialized. You can not read from an uninitialized atomic<T>
Chandler teaches about unspecified value, undefined value, and undefined behavior
Pablo: it is still is problem
Olivier: it is a non-issue, because we are not in C compatible
Lawrence: wording in LWG2334 is broken. We need to know if an atomic<T>
Chandler: typedefs for C interoperatibility, only for those types the "uninitialized" value makes sense, not for any other types. Lawrence: WRONG: There is a C macro that allows you to construct an atomic for an arbitrary C type that can be used in C++ as well. to construct such atomics (that are uninitialized).
Chandler: still, practical reasons C atomics useful are restricted to the given typedefs (theoretical downside with C compatibility). atomic<T> for all other T's there is not an issue of C compatibility.
Lawrence: atomic<T*> requires atomic_init
Chandler: C++'s atomic<T>'s ctor should call atomic_init, except for the T's where there exists a C-alias or if its constructed with the Macro ATOMIC(T*) in C++
there seems to be disagreement on the C implementation of ATOMIC()
Lawrence: we need some criteria to know, when to call atomic_init and when not.
Chandler suggests a resolution beyond the issue: Make C++ atomic<T> behave the C++ way.
Pablo: several levels of C compatibility. It is all about link-time compatibility, we don't need source compatibility. Chandler says: you want header files to share syntax. In header file you would declare static locals in inline functions that can be atomic that should compile in both C and C++. Be aware that there are differences.
Chandler suggest: C++ code without C compatibility in mind (atomic<T>) shouldn't need atomic_init. Code that shares atomics between C and C++ must use atomic_init, but it should define its atomics in a way that is compatible with C (using the ATOMIC(T) macro/keyword) and internal locks might not be initialized. Not exactly sure if that breaks existing code.
a lot of force and back explanation of when and how and if shared atomics are initialized dynamically
Lawrence: LWG 2334 resolution is wrong it is not tentatively ready. Go back to drafting. Lawrence works on that.
JF: should goal be agreed upon?
Chandler: why is this an issue to be addressed. code base example with plenty of atomic<T> and no atomic_init. Code is broken today, because C++ programmers don't expect uninitialized types (it works, because platform doesn't require initialization). Broken code works fine on most platform and goes undetected. atomic_init should be called by all C++ atomic<T> ctor but not T's ctor (different issue).
Chandler: we will never convince C++ programmers to use atomic_init correctly
Lawrence: shared code between C and C++ might re-initialze the embedded locks but one wouldn't recognize, because most platforms don't use them.
two parts: construct the internals of atomic<T> without initializing T. second part: value initialize T if it has a ctor.
Herb suggests to always do both parts, because there shouldn't be a problem wrt to performance.
Chandler: second part always doesn't seem to be a problem, is detectable and teachable.
Herb and Lawrence disagree if atomic<T> contains or doesn't contain T.
Respect to understand what "embedded locks" mean. If they all work with zero-initialized locks, we would be fine.
Disagreement: There is a chance that an embedded lock initialization might require an OS call that can not be repeated or undone. Need to learn from IBM if this can be an issue.
Herb reminds us to care about C++ programmers and not C.
Chandler, if there is a function shared between C and C++ it better behave the same or the C library stops working for C++ programmers.
Hans, we need more data (from IBM) and from WG14. Lawrence: we need to work with WG14, Hans: there is discussion about zero-initialization of atomics in WG14.
Michael needs time to find the answer within IBM what happens on S/390 internally. Depending on the answer, there might not be an issue.
Chandler: 4 orders of magnitudes of uses of atomic_init is in compiler test cases. some in Apache with different names. 2 C11 projects within google use atomic_init (switch-management software, open-mpi). We shouldn't care too much about existing C11 wide-spread use of atomics.
Michael will be able to know by Urbana
Lawrence works on resolution of LWG2334, depending on what IBM comes back with.
Hans will check with WG14
Coroutines add yield and resume to routines. Stackfull coroutines are user-mode threads. Stackless coroutines presented here. Focus on scalability, efficient, and open. Somewhat comparable to Boost context.
Return just finishes; it does not provide a value.
Does committee want resumable keyword? Some would rather not unnecessarily constrain ability of programmer to change code.
Needs default allocator constructor, so must be stateless. Maybe need to either eliminate allocators or think more carefully about how we handle them well.
No objection to 'yield return'.
Change hyphens to underscores.
Interested in generalization of data structures. Need to flesh it out in the library. Generalization of channels to nest.
Backoff procedures have no standard. Propose to provide an abstraction capturing that.
Why not just futex? Goal is to provide something more complete and automatic than futex. Futex by itself embeds a queue, which may be too much. Why not expose futex and spin facilies and composing them. May have value in accessing queuing facility of futex. Is there a use case for waiting without spinning. Case of doing prefetch probe and then blocking to wake when prefetch complete.
Possibly notify one versus all as a hint. What about large classes for load_with_update? We could have synchronic bool only, but synchronic int is handy because it does not require side allocation of a synchronic bool.
Suggest fourth approach to have separate concept, vs inheritance etc.
Tradeoff for minimal resource use. Condition variables are expensive because of their guarantees. Probably different operations for "do the spin", "after you". Yield this memory location.
MCS lock does not meet lockable concept because it needs state carried from lock to unlock.
Would like to see proposal go forward.
Only a few operators used to terminate dependency change.
Type modifier or variable modifier? To be thought about.
Plans do not seem to practically restrict the programs that folks want to write.
Hans raises the issue of divergence wrt to const in C and C++: C++11 allows a load to operate on a const qualified atomic, while C11 does not.
Lawrence: C takes const from an implementation view whereas C++ const is logical
Hans, Lawrence discuss
Hans suggest contact C committee to ask them to provide const for atomic loads "strongly"
Chandler: continue to keep C++'s const regardless what C does.
We strongly prefer that C11 add the ability to load from const atomic.
[Note added after the fact: Unbeknownst to us at the time, this is WG14 DR459, which was already moving in the same direction.]
Hans and Lawrence discuss "expected" issue with compare_exchange
If you read *expected after compare_exchange succeeds it is an error
Wording should say that *expected is not read after compare_exchange succeeds
a non-atomic read of *expected before the atomic operation and a non-atomic write of *expected after the atomic compare_exchange failed, no other reads or writes
discussion continues
Lawrence: "compiler must not be allowed to move a non-atomic read across an atomic operation"
Paul, Lawrence, Hans, JF continue discussion...
Paul: "what is the downside of fixing the code"
Lawrence: "wording is currently correct, but for subtle reasons"
Hans: "current wording should be fixed"
JF: "if the example implementation is fixed, even then *expected is concurrently modified"
Hans/Paul: no, there is synchronization happening, it is not a problem
Lawrence: the compiler has to do the right thing
Paul: example implementation won't correctly work
Lawrence: example implementation introduces a data race, if remove the example would work, implementation reads from expected twice, before and after, should only do it before.
Hans: agreement on direction: fix the wording for practical reasons, SG1 doesn't do it right now. Lawrence will do
[Note added after the fact: This is now LWG 2426.]
We wish to thank Microsoft for hosting, and the volunteer note-takers.