Concurrency with Modern C++: What every professional C++ programmer should know about concurrency |

Speaker 1

00:00

Welcome to the deep dive. We've got some really interesting material you sent over on concurrency in modern C plus plus POM. Looks like it's mostly drawing from the book Concurrency with Modern C plus plus POM.

Speaker 2

00:11

That's right, And our goal today is well, to unpack the core ideas, maybe find some of those aha moments for you.

Speaker 1

00:20

Yeah, make this whole complex topic a bit more accessible without drowning in the jargon exactly.

Speaker 2

00:25

And you know, the book itself kind of hints at the challenge. It mentions how the C plus plus memory model often runs counter to our intuition.

Speaker 1

00:33

Oh. Interesting, So that's our mission then, to navigate that complexity and pull out the essentials you need for writing you know, solid concurrency.

Speaker 2

00:41

Plus plus code, precisely efficient, dependable code. That's the end.

Speaker 1

00:46

Okay, let's get started then, right at the foundation the memory model. Yeah, in simple terms, what is it we're trying to wrap our heads around here.

Speaker 2

00:53

Well, think of the memory model as like the official rule book. It dictates how different threads in your program see and interact with the computer's memory.

Speaker 1

01:01

Okay, rules for memory interaction.

Speaker 2

01:03

And from a concurrency angle, two basic questions pop up. First, what counts as a single place in memory a memory location.

Speaker 1

01:11

Right, is it a bite an integer?

Speaker 2

01:14

According to the source, Yeah, it's either a basic scaler type like your ince floats, pointers, enoms, or if you have bitfields, it's the largest sort of continuous sequence of those bits.

Speaker 1

01:25

Got it. Scaler types are contiguous bitfields. What was the second question?

Speaker 2

01:28

Ah, the big one. What happens when multiple threads try to access that same memory location around.

Speaker 1

01:35

The same time. Okay, and I sense danger here?

Speaker 2

01:37

You got it? That leads us straight to data races. Okay, imagine two threads hitting the same shared variable. It's mutable, and at least one of those threads is trying to write to it.

Speaker 1

01:46

That's data race.

Speaker 2

01:47

That's the data race, and the result undefined behavior. Ah, the dreaded ub exactly the wild West. Your program might crash, spit out garbage, or maybe even seem to work fine for a while, then just fail later completely out of the blue.

Speaker 1

02:02

So that's why we need things like mute texts and locks. Yeah, to coordinate who gets access when precisely.

Speaker 2

02:07

There are the traffic cops for shared data access essential tools.

Speaker 1

02:11

The book uses thread safe singleton initialization as a classic example. Why is that such a good illustration. It seems simple, right, just one instance.

Speaker 2

02:20

Well, it seems simple in a single thread, but imagine multiple threads all deciding, hey, I need the singleton at the exact same time.

Speaker 1

02:28

Ah. So if they all check and see it doesn't exist yet, they.

Speaker 2

02:31

Might all try to create it, and suddenly you've got multiple singletons, which completely breaks the whole idea.

Speaker 1

02:36

Right, So, thread safe techniques make sure only one thread actually does the creation, even if many try.

Speaker 2

02:42

Exactly ensure as it's created exactly once.

Speaker 1

02:45

Now for digging deeper, the book mentions a tool called creepmem. What's that about?

Speaker 2

02:49

Oh, CRIPPYMM is fantastic for this. It's like a sandbox or a simulator for the C plus plus memory model.

Speaker 1

02:57

Okay.

Speaker 2

02:57

You feed it small snippets of concurrent code, and it shows you all the possible ways the operations from different threads could interleave. It visualizes the impact of different memory orderings.

Speaker 1

03:07

So you can actually see how things might go wrong or why a certain ordering works precisely.

Speaker 2

03:12

It helps build that intuition for how the memory model behaves, which, as we said, isn't always obvious.

Speaker 1

03:17

Really valuable tool, okay, memory model basics covered, let's talk about the threads themselves. We've had std dot thread for a while clus plus twenty added std dot j thread. What's the leap forward there?

Speaker 2

03:29

The big difference really is resource management safety.

Speaker 1

03:32

Also with sdd.

Speaker 2

03:34

Dot thread, you the programmer must remember to either join the thread, wait for it to finish, or detach it to run independently.

Speaker 1

03:41

And if you forget, if the std.

Speaker 2

03:43

Dot thread object gets destroyed before you do either, your program terminates. It's a common mistake.

Speaker 1

03:48

Ouch. Okay, So how does std dot j thread fix that.

Speaker 2

03:51

It's our Aii based resource acquisition is initialization. When an std dot j thread object goes out of scope, its destructor automatically calls joint.

Speaker 1

04:00

No more forgetting Nice. That sounds much safer it is.

Speaker 2

04:03

Plus, std dot j thread has built in support for cooperative interruption, a clean way to ask a thread to stop.

Speaker 1

04:09

Okay, cooperative interruption. We'll probably circle back to that now. The book mentioned something tricky with std dot shared ptr. I thought they were threads safe.

Speaker 2

04:17

They help with memory management in threads. Yes, they prevent leaks by managing the object's lifetime automatically, but the shared pointer itself isn't fully thread safe for all operations.

Speaker 1

04:28

What's the catch?

Speaker 2

04:28

It's the internal reference counter. If you have multiple threads all trying to say, assign a new shared pointer to the same shared pointer variable, especially if it was passed by reference, we could corrupt the count exactly. You can get a data RaSE on that internal counter. The book shows an example where this happens when threads modify a shared shared ptr passed by reference. The object being pointed to might be fine, but the pointer's bookkeeping gets messed up.

Speaker 1

04:53

So if I need multiple threads to safely update which object to shared pointer points to, what's the solution.

Speaker 2

05:00

The book suggests using std dot atomic store for that specific case to make the update atomic, but it also points out that this is kind.

Speaker 1

05:07

Of a workaround. What's the real fix? Then?

Speaker 2

05:09

Ideally we'd use atomic smart pointers like std dot atomas esdd dot shared ptr, which C plus plus twenty introduced that handles the atomicity of the pointer operations themselves.

Speaker 1

05:20

Okay, that makes sense. The book also mentioned std dot atomic cref.

Speaker 2

05:24

What's that for, ah, atomic cref. That's pretty neat. It lets you perform atomic operations on an existing object that wasn't originally declared. Std dot atomic, so you.

Speaker 1

05:34

Can temporarily treat a regular variable as atomic sort of.

Speaker 2

05:37

Yeah, you created an automic craft to it, and then you can use atomic operations like fetchad or compare exchange strong directly on that underlying variable through the reference. The example showed incrementing a counter inside some big object without needing locks or making the whole object atomic.

Speaker 1

05:53

Interesting, so careful management is key. This leads us nicely into memory ordering, sequential consistency, acquire release, relaxed. These sound like different levels of rules.

Speaker 2

06:03

They are. They're different contracts, different guarantees about how memory operations become visible across threads.

Speaker 1

06:09

Let's start with the strictest sequential consistency memory order.

Speaker 2

06:13

Seconds, right, That's the default for atomics, and it's the easiest to reason about it. Basically, guarantees two things. One, all threads agree on a single global order of all sequentially consistent operations, and two, the operations within any single thread happen in the order you wrote them in your code.

Speaker 1

06:33

Like one single timeline for everything.

Speaker 2

06:35

Exactly simple model, but it can sometimes have performance costs because the hardware has to work harder to maintain that global order.

Speaker 1

06:42

Okay, what about acchore release semantics. Then sounds like it loosens things up a bit.

Speaker 2

06:46

It does acquoire, release memory order, require memory order, release memory order, roll, and also consume, though that's trickier. Focuses on synchronization between operations on the same atomic.

Speaker 1

06:58

Variable, same variable, okay.

Speaker 2

07:00

Release operation. Typically a right ensures that all memory rights that happen before it in the same thread become visible to other threads that later perform an acquire operation usually read on that same atomic variable.

Speaker 1

07:13

So the release makes prior rights visible and the acquire sees them precisely.

Speaker 2

07:18

This creates what's called a synchronizes with relationship. It's fundamental. The book points out this is how mutexes, thread joins, condition variables, all the higher level stuff actually works under the hood. A lock release synchronizes with a subsequent.

Speaker 1

07:33

Lock acquire that synchronizes with it. Sounds important. It establishes order across threads.

Speaker 2

07:38

Yes, it establishes A happens before relationship. If action A synchronizes with action B, then A happens before B. This guarantees visibility of memory changes.

Speaker 1

07:49

Got it? Now? What about the most lenient one memory order relaxed? What guarantees do we lose there.

Speaker 2

07:55

With relaxed ordering, you only get the bare minimum the operation itself as atomic. It happens indivisibly, and there's a single modification order for that specific atomic variable. All threads will agree on the sequence of values written to that one.

Speaker 1

08:08

Variable, but no guarantees about other memory operations exactly.

Speaker 2

08:11

Relaxed operations don't create synchronizers with relationships. They don't guarantee anything about the visibility or ordering of other reads and writes, even to the same variable by different threads or to different variables.

Speaker 1

08:22

So potentially faster, but much harder to reason about, much harder.

Speaker 2

08:26

The book shows using fetchad with relaxed ordering for a simple counter, which is a common use case, but it also warns that you can still get data rass on non atomic variables even if you're reading related atomics with relaxed order, because there's no happens before relationship established. It's subtle stuff.

Speaker 1

08:43

And where do memory fences? Atomic thread fens.

Speaker 2

08:46

Fit in fences acts like barriers. They enforce ordering constraints between operations before the fence and operations after the fence, even across different variables or relaxed atomics. A release fence makes prior rights visible to threads that later cute and acquire fence. It's another way to establish that synchronizes with relationship, but without needing a specific atomic variable to mediate.

Speaker 1

09:08

Okay, that's a lot to digest on ordering. Let's shift to actually using threads. How do we launch them? What are the options?

Speaker 2

09:13

The main way is std dot thread. Its constructor can take basically any callable thing hollible thing. Yeah, like a regular function pointer or a function object you know, an object where you've overloaded the parentheses operator, or very commonly a lambda function.

Speaker 1

09:30

Right. Lambas are handy there.

Speaker 2

09:31

Super handy. You just passed the function or lambda you want to run in the new thread, followed by any arguments it needs. The book shows a simple Hello from thread using a lambda.

Speaker 1

09:41

And once it's running, we have to decide what happens when it finishes. Join or detach.

Speaker 2

09:47

Exactly, You have to make a choice before the std dot thread object itself is destroyed. Join means the current thread waits right there until the launched thread completes.

Speaker 1

09:58

Useful if you need its result or need to know it's done before cleaning up resources.

Speaker 2

10:02

Precisely. Detach. On the other hand, lets the thread run completely independently in the background. The original thread continues immediately.

Speaker 1

10:11

But that sounds risky. What if the detached thread needs data that the original thread owns.

Speaker 2

10:16

That's the big danger. If the original thread finishes and its data goes out of scope, but the detached thread is still running and tries to access that data, Boom, undefined behavior again. So the book advises joining, usually strongly advises joining, especially if the thread interacts with data whose lifetime is tied to the scope where the thread was created. Detaching requires very careful management of lifetimes.

Speaker 1

10:39

Makes sense. What about std dot thread dot hardware concurrency.

Speaker 2

10:43

It's said to us it gives you a hint, basically, an estimate of how many threads the hardware can genuinely run in parallel, often related to the number of CPU cores or hyperthreads.

Speaker 1

10:55

A hint, not a rule, definitely just a hint.

Speaker 2

10:57

The optimal number of threads depends heavily on the specific task, io, contention, et cetera. Using exactly this number isn't always best. The book mentions, it's just a starting point, a native handle that's an escape patch. It gives you direct access to the underlying operating systems thread handle like a thread on Linux or a handle on Windows. If you need to do something platform specific that the C plus plus standard library doesn't cover, use with caution though.

Speaker 1

11:21

Okay, got it. Let's move on to the tools we use with threads synchronization primitives, starting with the most basic STD mutex.

Speaker 2

11:29

Right, the mutex its core job is mutual exclusion, protecting shared data.

Speaker 1

11:34

How does it do that?

Speaker 2

11:35

Think of it as a lock guarding a piece of data. Before a thread can touch that data, it has to lock the mutex. If another thread already holds the lock, the first thread weights. Once it's done, it must unlock the mutex, allowing another waiting thread to proceed.

Speaker 1

11:50

So only one thread gets access at a time. Prevents data rases on that protected data exactly.

Speaker 2

11:55

Mutexes are your go to for protecting shared mutable state first line of defense.

Speaker 1

12:00

But the book warns about deadlocks. How did mutexes lead to that?

Speaker 2

12:05

Ah? The classic deadlock scenario. Imagine thread one locks mutex A, then tries to lock mutex B. Simultaneously, Thread two locks mutex B, then tries to lock mutex A.

Speaker 1

12:15

Oh. Thread one has A and wants B. Thread two has B and wants a.

Speaker 2

12:19

And they're stuck. Neither can proceed because it's waiting for the resource the other one holds. That's a deadlock. They wait forever, masty.

Speaker 1

12:26

How do we avoid that when we need multiple locks?

Speaker 2

12:28

The standard solution is std dot lock. You pass it all the mutexts you need to acquire. It uses a deadlock avoidance algorithm internally to try and lock all of them.

Speaker 1

12:37

Atomically atomically, meaning it gets all of them or none of them.

Speaker 2

12:40

Essentially. Yes, it guarantees it won't end up in a state where it holds some locks while blocking waiting for others in a way that contributes to deadlock. If it can't get all locks, it'll release any it acquired and try again, or perhaps throw an exception, depending on the context.

Speaker 1

12:54

Okay, so std dot lock for multiple mutexes. Yeah, good tip. We mentioned threads saf initialization earlier. Besides a simple lock. What other techniques does the book cover?

Speaker 2

13:04

Several good ones. If something can be a const expert, its value is fixed at compile time, so that's inherently thread.

Speaker 1

13:09

Safe, right, no runtime race possible.

Speaker 2

13:11

Well, then there's std dotkalents with the std dot once flag. You pass it a flag and a function like your initialization function. The standard guarantees that function will be executed exactly once by the first thread that calls it, even if many threads call it concurrently, other threads will wait until the first one is done.

Speaker 1

13:29

Okay, that sounds robust.

Speaker 2

13:30

Very Another common C plus plus idiom, especially since C plus plus eleven, is the Meyers singleton. Using a static variable inside a function.

Speaker 1

13:40

Like static my singleton instance return.

Speaker 2

13:43

Instance exactly that. The language guarantees that the initialization of that static local variable is thread safe. The compiler and runtime handle the locking implicitly. It's often the simplest and preferred way.

Speaker 1

13:56

Now simple as good any others.

Speaker 2

13:59

Well of all, if your program structure allows it is just initialize the shared resource in your main thread before you create any other threads. No concurrency Doing initialization means no problem?

Speaker 1

14:11

Fair enough? What about signaling between threads, like, hey, the data you're waiting for is ready. That's std dot condition variable precisely.

Speaker 2

14:19

Condition variables let threads weight efficiently until some condition becomes true.

Speaker 1

14:23

How do they work? Do they need a mutex?

Speaker 2

14:25

Yes? They always work together with the mutex. A waiting thread must first lock the mutex protecting the shared state at the condition. Then it calls weight on the condition variable, and weight does what. It atomically releases the mutex and puts the thread to sleep. It waits until another thread notifies.

Speaker 1

14:42

It notifies it how by calling.

Speaker 2

14:43

Notify one or notifile on the same condition variable. When the waiting thread wakes up, it automatically reacquires the mutex before weight returns.

Speaker 1

14:54

Okay, it wakes up, gets the locked back. Then it can check the condition exactly.

Speaker 2

14:58

And this is crucial. It must check the condition again after waking up.

Speaker 1

15:02

Why didn't get notified because the condition is true.

Speaker 2

15:05

Not necessarily, you can get spurious wakeups where the thread wakes up even though no notification happen or the condition changed back. That's why weight functions usually take a predicate, a lambda or function that checks the actual condition. The weight will only return if the predicate is true or if interrupted.

Speaker 1

15:21

Ah, so the predicate handles spurious wakeups. Never wait with that one.

Speaker 2

15:26

That's the rule. Always weight with the predicate.

Speaker 1

15:28

Now C plus plus twenty brought cooperative interruption std dot stop source stop token. How does that fit in? Especially with j thread and condition.

Speaker 2

15:38

Very blany right, This is a much better way to ask threads to stop than say, just setting a boolean flag. It's more integrated.

Speaker 1

15:44

How does it work.

Speaker 2

15:45

You create a std dot stop source. This object can request that associated operations stop. From the stop source, you get std dot stop tokens. You pass these tokens to the threads or operations.

Speaker 1

15:57

You might want to interrupt, and the thread checks up the token.

Speaker 2

16:00

Yes, a thread can periodically call stop requested on its token. Or even better, many blocking functions, like the weight functions on std dot condition variably needs, and the ones in J thread implicitly can accept a stop token. They'll automatically wake up if a stop is requested on that token.

Speaker 1

16:17

So J thread uses this automatically.

Speaker 2

16:19

J thread has a stop source built in. If you create a J thread with a function that takes a stop token as its first argument, the J threads destructor will automatically request stop before joining. It makes graceful shut down.

Speaker 1

16:30

Much easier and std dot stop call back that lets.

Speaker 2

16:33

You register a function that gets called immediately when stop is requested on a given token, useful for things like quickly closing a socket or canceling an io operation.

Speaker 1

16:42

Okay, a much cleaner stop mechanism. What about STD dot counting semaphore also C plus plus twenty. How's that different from a mutex?

Speaker 2

16:48

A mutex is about exclusive access only one thread in at a time. A semaphore maintains a counter representing available resources or permits.

Speaker 1

16:57

How does that work?

Speaker 2

16:58

A thread calls a choir to take a permit, decrementing the counter. If the counter is zero, the thread blocks. A thread calls release to return a permit, incrementing the counter, potentially waking up a blocked thread.

Speaker 1

17:11

Can different threads acquire and release?

Speaker 2

17:13

Yes, that's a key difference from utexas, which are usually locked and unlocked by the same thread Somemophores are great for controlling access to a pool of n resources or for producer consumer scenarios where one thread signals another about available work. They're thread agnostic.

Speaker 1

17:29

Interesting. Lastly, for basic sinc C plus plus twenty also give us STD dot barrier and std dot latch. Yeah, coordinating multiple threads exactly.

Speaker 2

17:38

Both are for synchronizing a group of threads at a specific point.

Speaker 1

17:41

What's the difference latch versus barrier.

Speaker 2

17:43

A SSTD dot latch is basically a one shot countdown. You initialize it with a count threads call countdown. When the count reaches zero, any threads waiting on the latch using weight are unblocked. After that, the latch is done. It can't be reset.

Speaker 1

17:55

One time use and a barrier.

Speaker 2

17:58

A std dot barrier is reusable. You initialize it with the number of threads in the group. Each thread calls arrive and weight. When all threads have arrived, they are all unblocked simultaneously. Crucially, the barrier resets ready for the next synchronization phase. You can even run a completion function when all threads arrive but before they're unblocked.

Speaker 1

18:18

So latch for a single sync point barrier for repeated phases of computation.

Speaker 2

18:24

That's a good way to think about it. The book shows an example of barriers being used across different stages where the number of workers might even change.

Speaker 1

18:30

Okay, let's move up a level to tasks and futures. Std dot ASNC sounds really convenient for running stuff in the background.

Speaker 2

18:37

It is. It's a high level way to say, run this function, possibly on another thread, and give me back something I can use to get the result later.

Speaker 1

18:45

That's something is the future exactly.

Speaker 2

18:47

Std dot ACNC returns std dot future object, and it handles the thread management, often using an internal.

Speaker 1

18:53

Threadpool you mentioned, possibly on another thread right.

Speaker 2

18:57

Std dot ACNC takes an optional launch policy. The default std dot launch dot acing std dot launch dot deferred usually runs it on a new thread eager evaluation. But you can specify sdd dot launch dot acing to guarantee a new thread, or std dot launch dot deferred to make it.

Speaker 1

19:14

Lazy lazy evaluation meaning.

Speaker 2

19:16

Meaning the function only runs when you actually call get or wait on the future it runs synchronously in the thread that calls get.

Speaker 1

19:23

Interesting trade off. What about std dot package task How does that fit?

Speaker 2

19:26

In std dot package task gives you more control. It bundles up a function or callable with a promise the thing that will eventually hold the result. Okay, it gives you back at sdd dot future associated with that promise. But crucially, the task doesn't run yet. You are responsible for invoking the package task object itself, maybe passing it to a thread you manage, or putting it in your own queue for a threadpool.

Speaker 1

19:47

So ACNC is fire and forget with a future back package task is prepare and run later.

Speaker 2

19:53

That's a good way to put it package. Task decouples defining the work from executing it.

Speaker 1

19:57

In std dot future itself. Yeah, just a placeholder for the result pretty much.

Speaker 2

20:02

It represents a result that will eventually be available from some asynchronous operation. You call get on it to retrieve the value those get block Yes. If the result isn't ready yet, debt blocks the calling thread until it is. Also, importantly, you can typically only call get once on a regular sdd dot future.

Speaker 1

20:22

The result is moved out only once. What if multiple threads need the result?

Speaker 2

20:27

Ah, that's where std dot shared future comes in. You can create a shared future from a sdd dot future which consumes the original future. Copies of the shared future can then be given to multiple threads, and they can all call get to retrieve a copy of the result once it's ready.

Speaker 1

20:41

Okay, so future for single result retrieval, shared future for multiple correct. How do futures compared to condition variables for synchronization, The book mentioned a comparison.

Speaker 2

20:51

They serve different purposes. Mostly, conditioned variables are more general purpose for complex waiting logic, maybe involving multiple conditions or repeated signaling. Futures are primarily designed for getting a single result back from a one off task.

Speaker 1

21:04

So futures are simpler for the GATA result case.

Speaker 2

21:07

Often yes, they bundle the data transmission, the result or exception with the synchronization. With conditioned variables, you manage the shared data and locking separately, which can be more error prone if not done carefully. Tasks futures are often less susceptible to issues like lost way cups.

Speaker 1

21:23

Right DA, that makes sense. Let's switch gears to the parallel algorithms in the SEL. Since C plus plus seventeen many of them can run in parallel.

Speaker 2

21:30

Yes, a huge addition. Many standard algorithms like four each, transform, reduced sort, et cetera, now have overloads that take in execution policy as the first argument.

Speaker 1

21:39

Execution policy, what are the options?

Speaker 2

21:41

The main ones are std dot execution dot SICK for sequential execution, the old default std dot execution dot PR for parallel execution on multiple threads, and std dot execution dot PARENTSEC for parallel and potentially vectorized execution.

Speaker 1

21:57

Vectorized like SIMD.

Speaker 2

21:59

Exactly, parentsec gives the implementation the most freedom. It can run jumps in parallel, and within each thread, it can reorder or interleave operations on different elements. Often to take advantage of SIMD instructions if the hardware supports.

Speaker 1

22:10

It, so potentially the fastest, but maybe harder for the programmer to reason about if side effects are involved.

Speaker 2

22:16

Precisely, PARENTSEC demands more care regarding thread safety and lack of dependencies between element operations. The book shows four each with parentsec using an atomic counter, which is safe.

Speaker 1

22:28

What about algorithms like std dot reduce or std dot transform reduce, any special rules for parallelizing.

Speaker 2

22:35

Them, Yes, a very important one. The operation you provide addition for reduce or multiplication and addition for transform reduce must be associative and commutative for the parallel versions to guarantee the same result as the sequential one.

Speaker 1

22:48

Associative and commutative like addition A plus B plus C plus B plus c and A plus b egals b plus a exactly.

Speaker 2

22:56

If your operation doesn't have those properties, the result might differ depending on how the parallel execution chunks and combines the data. Floating point Edition strictly speaking, isn't associative, which can sometimes cause tiny differences.

Speaker 1

23:08

Good point. Are these policies just hints or guarantees of parallelism, They're.

Speaker 2

23:13

More like permission slips or strong hints SDD dot execution. DOT par allows parallel execution, but the library implementation might decide to run it sequentially if it thinks that's faster. Eventually, for very small ranges, it's not a strict guarantee of end threads being used.

Speaker 1

23:29

Does the book give any performance numbers? Is this speed up real?

Speaker 2

23:33

Yes, it shows a test case calculating tangents. The PAR version on their quad core machine was significantly faster than SAKE, close to a four x speed up. The parentsec version was similar to PAR in that specific test, but your mileage may vary absolutely. Performance depends heavily on the hardware, the compiler, the specific algorithm, the data size, and the operation being performed. Always benchmark your own code.

Speaker 1

23:56

Sound advice. Okay, let's tackle a really modern feature. C plus plus twenty quarantines. What's the fundamental difference from a regular function.

Speaker 2

24:03

The key idea is that they are resumable functions stackless, specifically.

Speaker 1

24:07

Resumable, meaning they can pause and.

Speaker 2

24:09

Continue later exactly. A regular function runs from start to finish in one go. A core utine can execute a bit, then cowight some operation or coiled of value, which suspends its execution. Later, something can resume the core routine and it picks up right where it left off, with all its local variables intact.

Speaker 1

24:29

So the state is saved somewhere, not just on the stack.

Speaker 2

24:32

Right the state local variables suspension point is typically allocated on the heap or in a quarutine frame managed by the compiler, not just the traditional call stack. That's the stackless part.

Speaker 1

24:43

What makes a function become a core routine? Is there a special keyword?

Speaker 2

24:46

It becomes a quarantine if its body uses any of the three Cortine keywords core return to return a value and finish, suspend, co weight to suspend and wait for something, or coiled to produce a value in a generator like sequence. Even a range based for loop using co weight makes it a core routine.

Speaker 1

25:04

Okay, co return, cowight, coy yield. The book mentions handles, suspend points, awaitables. Sounds like the machinery behind it it is.

Speaker 2

25:12

Let's break it down quickly, go for it.

Speaker 1

25:13

Quarantine handle that's.

Speaker 2

25:15

Your remote control for the qure routine. An object you can use to resume it, destroy its state, or check if it's done.

Speaker 1

25:21

Initial and final suspend points.

Speaker 2

25:23

Every quarantine has a promise object associated with it. This promise defines initial suspend and final suspend. These return special awaitable objects like std dot suspenda ways or std dot suspend never that control whether the quarantine suspends immediately when called, and whether it suspends when it finishes via core return or falling off the end.

Speaker 1

25:43

So you can have a couroretine start suspended or run until.

Speaker 2

25:45

The first CO eight exactly, and you can control if it cleans itself up automatically when done, or waits to be destroyed.

Speaker 1

25:51

Via its handle and awaitables of waiters. That's for coit right.

Speaker 2

25:55

When you cowight something that's something has to be an awaitable. The compiler calls three key methods on the corresponding a weight object, often the awaitable itself. A weight ready checks if suspension is even needed. If not, it continues. If suspension is needed, a weight suspend is called, which suspends the quarantine and can schedule it for resumption later. When it's time to resume, a weight resume is called and its return value becomes the result of the cowight expression.

Speaker 1

26:23

Okay, that's the core mechanism. The book had examples. One preparing a job another using an event core routine.

Speaker 2

26:29

Yeah, the job example was basic, showing the structure even if it didn't suspend much initially. The event example was more interesting for synchronization.

Speaker 1

26:36

How did the event work.

Speaker 2

26:38

It was a quarantine helper. You could cowight an event object that courantine would suspend elsewhere code could call notify on the event, which would resume the weighting core routine. It's a way to build synchronization primitives using the quarantine machinery itself.

Speaker 1

26:53

Like a condition variable, but maybe fitting more naturally into acen code flow.

Speaker 2

26:57

Kind of. Yeah, it shows how coroutines can help manage asynchronous waiting.

Speaker 1

27:01

Let's look at the case studies. The book compared something numbers in different ways. What was the fastest concurrent approach?

Speaker 2

27:08

Right? They compared symbol, threaded, mutex protected, shared, some atomic shared, some with different orderings, and finally a local sum approach. Then the winner was by far the best concurrent performance came from having each thread calculate a sum for its own portion of the data into a local non shared variable. Then only at the very end, each thread atomically adds its local sum to the final shared result.

Speaker 1

27:34

Variable, so minimize the shared operations do most work locally exactly.

Speaker 2

27:38

Contention on the mutex or even the atomic variable in the other approaches really killed performance locking or atomic ops on every single edition was very slow compared to the local accumulation.

Speaker 1

27:49

Makes sense. The dining Philosopher's problem also came up. What classic concurrency issues does that highlight?

Speaker 2

27:54

Oh Dining Philosophers is the poster child for deadlock. It perfectly illustrates how multiple actors philosophers competing for multiple shared resources forks can easily get into a state where none can proceed because they're all waiting for a resource held by.

Speaker 1

28:09

Another the circular weight exactly.

Speaker 2

28:12

The book uses it to show how flawed synchronization attempts can lead to deadlock or maybe livelock, where they're busy trying but making no progress, and it shows solutions like establishing a strict ordering for acquiring the resources always pick up the lower numbered fork first, for instance, to break that circular dependency.

Speaker 1

28:30

So resource ordering is a key deadlock prevention technique.

Speaker 2

28:34

One of the most common and effective.

Speaker 1

28:36

Ones singleton initialization again block based double checked locking, Meyer singleton, what's the verdict?

Speaker 2

28:42

Lock based is simple but potentially slow under contention double check locking tries to optimize by checking first before locking, but it's notoriously hard to get right and C plus plus without hitting subtle memory ordering bugs. Avoid it unless you really know what you're doing.

Speaker 1

28:58

So Meyer's singleton datic local variable.

Speaker 2

29:01

That's generally the way to go in modern C plus plastatic tea instance return instance, since C plus plus eleven the language guarantees this is thread safe and efficient, simple correct, usually fast enough.

Speaker 1

29:12

Good takeaway, Yeah, prefer meyer singleton CPMM was used again to look at memory ordering and data races.

Speaker 2

29:19

Yes, analyzing small examples, it visually reinforces how without proper synchronization like mutexes or acquire release semantics, rights in one thread are simply not guaranteed to be visible to reads in another thread if they access the same non atomic memory location. It makes the abstract memory ordering rules much more concrete.

Speaker 1

29:40

Seeing is believing, basically pretty much.

Speaker 2

29:42

It helps you spot potential data races you might otherwise miss.

Speaker 1

29:45

There was also a comparison condition variables versus atomic flags for synchronization between two threads.

Speaker 2

29:51

What was faster in the specific tests shown, which involved repeated ping pong synchronization using atomic flags like std dot atomic flag or std dot com atomic bool for the signaling was found to be faster than using conditioned variables in mutexes.

Speaker 1

30:05

Why would that be It's likely due to overhead.

Speaker 2

30:08

Atomic operations, especially on flags, can often be implemented very efficiently by the processor, sometimes without involving the operating system kernel. Conditioned variables in mutexes usually involves system calls for blocking and waking threads, which adds more overhead.

Speaker 1

30:24

So for simple signaling, atomics might be quicker, but condition variables are more general.

Speaker 2

30:28

That's a reasonable summary.

Speaker 1

30:29

Yeah, and the last case study bit a coroutine returning a future.

Speaker 2

30:33

Yes, that just showed the nice integration. You can write a function as a coroutine using coweight for internal ACYNC operations, and then use coreturn to provide the final value. The compiler automatically hooks this up, so the corotine returns an std dot future or similar awaitable type that completes with the core return.

Speaker 1

30:51

To value, seamlessly bridging the two models exactly.

Speaker 2

30:54

Let's use the corotine syntax internally while still interacting with other code expecting futures.

Speaker 1

31:00

Okay, let's gaze into the crystal ball see plus plus twenty three and beyond. Executors are presented as a big deal. What's the core concept?

Speaker 2

31:07

Executors are intended to be a fundamental abstraction for how, where, and when work gets done. They define the execution context.

Speaker 1

31:15

Execution context like which threadpool or run on the GPU or.

Speaker 2

31:21

Inline potentially all of the above. The idea is to separate what the function or task you want to run from the how or when, which is defined by the executor. You'd submit your task to an executor, and the executor decides how to run it.

Speaker 1

31:33

So it's a unified way to handle thread pools, inline execution, maybe event loops.

Speaker 2

31:38

That's the goal, a standard, composable way to represent and manage different execution strategies. The book sees them as foundational for future concurrency libraries, networking, etc.

Speaker 1

31:49

What kind of properties can these executors have?

Speaker 2

31:51

The proposals discuss properties like directionality, is it fire and forget one way? Does it return a future two way? Does it support continuations then then cardinality? Does it run one task, single or many? Bulk blocking behavior? Does submitting work potentially block the caller? Possibly always never?

Speaker 1

32:09

And you could potentially query or acquire executors with certain properties.

Speaker 2

32:12

That's the idea, using mechanisms like execution dot require or prefer to tailor the execution context to your needs.

Speaker 1

32:20

How might this integrate with things like std dot ASNC or parallel algorithms.

Speaker 2

32:25

The vision is you'd be able to pass an executor object to std dot ASNC or to the parallel algorithms to control where and how they run, instead of them always using some default mechanism like a hidden global threadpool. More control, more flexibility.

Speaker 1

32:38

What are the main design goals for executors?

Speaker 2

32:41

Usability both for library writers building on them and for application developers using them, composabilities so you can layer and combine executors, and minimality, keeping the core concepts lean and extensible.

Speaker 1

32:53

You mentioned single versus bulk cardinality. What's the difference in how they execute?

Speaker 2

32:57

Single cardinality functions one way execute, two U way execute then execute, take one callable and run at once. Bulk functions take a callable and a shape like account and run the callable multiple times, possibly in parallel, passing an index or other info to each invocation. Useful for parallel for style operations.

Speaker 1

33:14

Got it? The book mentioned some ongoing concerns like when all wenny return types and blocking future destructors.

Speaker 2

33:21

Yeah, those are known complexities. Combining futures with when all weny can lead to complicated return types, and the fact that std dot ASNC can return a future that blocks in its destructor if you don't get the result is problematic as it can accidentally serialize your code. There are active proposals trying to refine these areas.

Speaker 1

33:41

What about synchronized in atomic blocks in C plus plus twenty three sound related but different.

Speaker 2

33:47

They are both aim for atomic execution of a code block. Synchronized blocks are more relaxed. They act like the block is guarded by a single global mutex, providing a total order. They can contain things like io atomic blocks at no accept atomic commit, atomic cancel are for true transactions. They have stricter rules about what they can contain no non transactions, safe operations, and explicit handling of exceptions commit or cancel the transaction.

Speaker 1

34:11

So atomic blocks are closer to database transactions.

Speaker 2

34:14

Conceptually, yes, aiming for that kind of atomicity, though without the durability aspect.

Speaker 1

34:19

Usually and taskblocks a fork joint model.

Speaker 2

34:21

Right. Taskblocks provide structured parallelism. You define a work launch subtasks within it. The fork the thread that started the block, automatically waits at the end of the block until all launch subtasks are complete. The join makes managing parallel task dependencies much simpler.

Speaker 1

34:37

Okay, And lastly, for C plus plus twenty three, the Data Parallel Vector Library SAMD.

Speaker 2

34:44

This aims to standardize SIMD programming in C plus plus, providing standard vector types that map to hardware SIMD registers and operations on them. Features like masked operations apply an operation only where a condition is true, and traits to query vector properties like size are part of it, making SIDY more portable and accessible.

Speaker 1

35:03

Lots of interesting stuff potentially coming. Let's switch to synchronization patterns, architectural design idioms. What's the difference?

Speaker 2

35:09

Think levels of abstraction. Architectural patterns like reactor proactor define the high level structure of a concurrent system. Design patterns like active object monitor describe common interaction solutions between components. Idioms are lower level se plus specific techniques like scope locking.

Speaker 1

35:26

And using patterns helps out.

Speaker 2

35:27

Gives you a shared vocabulary, makes designs clearer, lets you reuse proven solutions instead of reinventing the wheel. They build on best practices but are more specific named solutions to recurring problems.

Speaker 1

35:41

What patterns help with managing shared data?

Speaker 2

35:43

The book mentions things like copying the value avoids sharing mutable state altogether good for value types, thread specific storage. Each thread gets its own copy using futures, share the result once it's ready.

Speaker 1

35:57

And patterns for handling mutations safely one include scoped locking using std dot lockered or std.

Speaker 2

36:04

Dot unique lock for RAI mutex management, strategize locking using templates or polymorphism to vary the locking strategy, thread safe interface designing the class itself to handle internal synchronization, guarded suspension using condition variables to wait for preconditions, and always. The book warns about lifetimes when passing references to threads.

Speaker 1

36:22

Right the architectural patterns active object, monitor, half sync, have async reactor proactor.

Speaker 3

36:28

Can we get a super quick idea of each okay quick fire active object decouple's method call from execution uses an internal thread and message que monitor object synchronizes access to an object's methods, usually one lock for the whole object.

Speaker 2

36:43

Half sync half ACYNC separates asinc tasks verr gio in a thread pool from sync processing. You examle main logic thread, often using a queue reactor, single thread, weights for events, synchronously dispatches to handlers. Proactor waits for completion of ace zinc operations, then calls handlers leverages acinc os features.

Speaker 1

37:04

Got it. The book uses boost assio for a reactor example.

Speaker 2

37:06

Yes, showing how that library implements the event loop and handler dispatch typical of the reactor pattern.

Speaker 1

37:11

Moving on to best practices, what are the absolute top.

Speaker 2

37:14

Ones number one, far and away. Minimize shared mutable state If data isn't shared or isn't mutable, concurrency gets vastly simpler.

Speaker 1

37:21

Avoid the problem if you can.

Speaker 2

37:23

Exactly if you must have shared mutable state. Ensure proper synchronization mutext as atomics, et cetera. Minimize waiting time. Amdell's law limits speed up based on sequential parts. Use a mutability const expert where possible. Use RAII for locks lockguard. Don't use condition variables without predicates. Prefer higher level tools a std dot acinc. Parallel algorithms over raw threads when appropriate, and understand the memory model.

Speaker 1

37:50

Understand the memory model. It seems full circle on that one. Okay, Concurrent data structures, stacks, ques, What are the challenges?

Speaker 2

37:59

The main chate is maintaining the data structure's internal consistency. It's invariance when multiple threads are operating on it simultaneously. A simple example is a stack. What if one thread tries to pop while another is reading the top. You might get inconsistent results or errors.

Speaker 1

38:13

How do you fix that? Locks?

Speaker 2

38:15

Locks are the first step. Coarse grained locking one big lock for the whole structure is simpler, but limits concurrency. Fine grained locking multiple locks for different parts allows more parallelism, but is much harder to get right.

Speaker 1

38:27

The book mentioned changing the interface sometimes.

Speaker 2

38:29

Yes like instead of separate top on pop on a stack, provide a single atomic top app operation. This avoids the race condition between checking the top and removing it.

Speaker 1

38:40

What about lock free structures?

Speaker 2

38:41

That's the next level, avoiding locks entirely using a comic operations like compare and swap. This can offer better performance and avoids deadlock, but it's extremely complex. You run into issues like the ABA problem, and memory reclamation becomes very tricky.

38:57

The book mentions hazard pointers as one solution a problem where a location reads value a then computation happens then it reads A again, but in the meantime another thread changed it to B and then back to A. Your comparent swap might succeed, thinking nothing changed, but the underlying state is different. Needs careful handling.

Speaker 1

39:17

Wow, concurrent data structures sound like a deep topic on their.

Speaker 2

39:20

Own, they really are. The book gives a taste, including a lock free stack using C plus plus twenty atomic smart pointers.

Speaker 1

39:26

What about the time library chrono? How did that relate?

Speaker 2

39:29

Krono is essential for measuring performance, setting timeouts, managing timed weights. It provides clocks system clock for wall time, steady clock for intervals, time points, specific moments and durations, intervals, very flexible for representing time.

Speaker 1

39:44

Accurately, and atomic operations, transactional memory any final points there.

Speaker 2

39:50

The book mentions atomics should ideally be addressed free atomic even across processes sharing memory, and it touches on ACD properties atomicity, consistency, isolation, durability for transactions, noting that C plus plus transactional memory proposals focus mainly on AC and I, with durability being less of a focus than in databases.

Speaker 1

40:10

And finally, a glossary that sounds useful.

Speaker 2

40:13

Very concurrency has a lot of specific terminology, acquire, release, data, RaSE, deadlock, sequential consistency, et cetera. The glossary helps nail down those definitions.

Speaker 1

40:21

Wow, Okay, we have definitely covered a lot of ground there based on the material. Real deep dive into modern C plus plus.

Speaker 2

40:27

Concurrency absolutely from the memory models, tricky foundations, rite up to currotines, parallel algorithms, and a peak at what's coming with executors. C plus plus provides a pretty powerful set of tools.

Speaker 1

40:39

So for you lifting in, I think the big takeaway is that concurrency really forces a shift in thinking compared to sequential code, timing, interaction, ordering. It all becomes critical, right.

Speaker 2

40:52

It unlocks performance potential, but also opens up whole new categories of bugs like data races and deadlocks if you're not careful.

Speaker 1

41:00

Vigilance is needed, and this is an evolving field, right.

Speaker 2

41:03

Definitely, the C plus plus standard keeps adding features, best practices emerge, So we'd encourage you to maybe pick an area that caught your interest today and dig deeper. Try out the parallel algorithms, maybe write a small program using j thread. When C plus plus twenty three features become available, experiment with executors.

Speaker 1

41:21

Or Even if you're feeling brave, try implementing a simple lock free structure just to appreciate the complexity exactly.

Speaker 2

41:27

The more you work with it, the better your intuition becomes.

Speaker 1

41:30

Ultimately, understanding of these concepts put you in a much stronger position to build modern software. Being able to reason about concurrency is just It's becoming a non negotiable skill in our multi core world.

Speaker 2

41:42

Couldn't agree more.

Speaker 1

41:44

Thanks for joining us on this deep dive.

Transcript source: Provided by creator in RSS feed: download file

Concurrency with Modern C++: What every professional C++ programmer should know about concurrency

Episode description

Transcript