Concurrency

Prev: Logic Languages Next: Scripting Languages

13.1 Check Your Understanding

Explain the distinctions among concurrent, parallel, and distributed.
Explain the motivation for concurrency. Why do people write concurrent programs? What accounts for the increased interest in concurrency in recent years?
Describe the implementation levels at which parallelism appears in modern systems, and the levels of abstraction at which it may be considered by the programmer.
What is a race condition? What is synchronization?
What is a context switch? Preemption?
Explain the concept of a dispatch loop. What are its advantages and disadvantages with respect to multithreaded code?
Explain the distinction between a multiprocessor and a cluster; between a processor and a core.
What does it mean for memory in a multiprocessor to be uniform? What is the alternative?
Explain the coherence problem for multicore and multiprocessor caches.
What is a vector machine? Where does vector technology appear in modern systems?

13.2 Check Your Understanding

Explain the differences among coroutines, threads, lightweight processes, and heavyweight processes.
What is quasiparallelism?
Describe the bag of tasks programming model.
What is busy-waiting? What is its principal alternative?
Name four explicitly concurrent programming languages.
Why don’t message-passing programs require explicit synchronization mechanisms?
What are the tradeoffs between language-based and library-based implementations of concurrency?
Explain the difference between data parallelism and task parallelism.
Describe six different mechanisms commonly used to create new threads of control in a concurrent program.
In what sense is fork / join more powerful than co-begin?
What is a thread pool in Java? What purpose does it serve?
What is meant by a two-level thread implementation?
What is a ready list?
Describe the progressive implementation of scheduling, preemption, and true parallelism on top of coroutines.

13.3 Check Your Understanding

What is mutual exclusion? What is a critical section?
What does it mean for an operation to be atomic? Explain the difference between atomicity and condition synchronization.
Describe the behavior of a test_and_set instruction. Show how to use it to build a spin lock.
Describe the behavior of the compare_and_swap instruction. What advantages does it offer in comparison to test_and_set?
Explain how a reader-writer lock differs from an “ordinary” lock.
What is a barrier? In what types of programs are barriers common?
What does it mean for an algorithm to be nonblocking? What advantages do nonblocking algorithms have over algorithms based on locks?
What is sequential consistency? Why is it difficult to implement?
What information is provided by a memory consistency model? What is the relationship between hardware-level and language-level memory models?
Explain how to extend a preemptive uniprocessor scheduler to work correctly on a multiprocessor.
What is a spin-then-yield lock?
What is a bounded buffer?
What is a semaphore? What operations does it support? How do binary and general semaphores differ?

13.4 Check Your Understanding

What is a monitor? How do monitor condition variables differ from semaphores?
Explain the difference between treating monitor signals as hints and treating them as absolutes.
What is a monitor invariant? Under what circumstances must it be guaranteed to hold?
Describe the nested monitor problem and some potential solutions.
What is deadlock?
What is a conditional critical region? How does it differ from a monitor?
Summarize the synchronization mechanisms of Ada 95, Java, and C#. Contrast them with one another, and with monitors and conditional critical regions. Be sure to explain the features added to Java 5.
What is transactional memory? What advantages does it offer over algorithms based on locks? What challenges will need to be overcome before it enters widespread use?
Describe the semantics of the HPF/Fortran 95 forall loop. How does it differ from do concurrent?
Why might pure functional languages be said to provide a particularly attractive setting for concurrent programming?
What are futures? In what languages do they appear? What precautions must the programmer take when using them?
Explain the difference between AND parallelism and OR parallelism in Prolog.

13.7 Exercises

Give an example of a “benign” race condition, one whose outcome affects program behavior, but not correctness.
We have defined the ready list of a thread package to contain all threads that are runnable but not running, with a separate variable to identify the currently running thread. Could we just as easily have defined the ready list to contain all runnable threads, with the understanding that the one at the head of the list is running? Hint: Think about multiprocessors.
Imagine you are writing the code to manage a hash table that will be shared among several concurrent threads. Assume that operations on the table need to be atomic. You could use a single mutual exclusion lock to protect the entire table, or you could devise a scheme with one lock per hash-table bucket. Which approach is likely to work better, under what circumstances? Why?
The typical spin lock holds only one bit of data, but requires a full word of storage, because only full words can be read, modified, and written atomically in hardware. Consider, however, the hash table of the previous exercise. If we choose to employ a separate lock for each bucket of the table, explain how to implement a “two-level” locking scheme that couples a conventional spin lock for the table as a whole with a single bit of locking information for each bucket. Explain why such a scheme might be desirable, particularly in a table with external chaining.
Drawing inspiration from Examples 13.29 and 13.30, design a nonblocking linked-list implementation of a stack using compare_and_swap. When CAS was first introduced, on the IBM 370 architecture, this algorithm was one of the driving applications.
Building on the previous exercise, suppose that stack nodes are dynamically allocated. If we read a pointer and then are delayed, for example due to preemption, the node to which the pointer refers may be reclaimed and then reallocated for a different purpose. A subsequent compare-and-swap may then succeed when logically it should not. This issue is known as the ABA problem.

Give a concrete example, an interleaving of operations in two or more threads, where the ABA problem may result in incorrect behavior for your stack. Explain why this behavior cannot occur in systems with automatic garbage collection. Suggest what might be done to avoid it in systems with manual storage management. 7. We noted in Section 13.3.2 that several processors, including the ARM, MIPS, and Power, provide an alternative to compare_and_swap (CAS) known as load_linked / store_conditional (LL/SC). A load_linked instruction loads a memory location into a register and stores certain bookkeeping information into hidden processor registers. A store_conditional instruction stores the register back into the memory location, but only if the location has not been modified by any other processor since the load_linked was executed. Like compare_and_swap, store_conditional returns an indication of whether it succeeded or not.

(a) Rewrite the code sequence of Example 13.29 using LL/SC.

(b) On most machines, an SC instruction can fail for any of several “spurious” reasons, including a page fault, a cache miss, or the occurrence of an interrupt in the time since the matching LL. What steps must a programmer take to make sure that algorithms work correctly in the face of such failures?

(c) Discuss the relative advantages of LL/SC and CAS. Consider how they might be implemented on a cache-coherent multiprocessor. Are there situations in which one would work but the other would not? Hints: Consider algorithms in which a thread may need to touch more than one memory location. Also consider algorithms in which the contents of a memory location might be changed and then restored, as in the previous exercise. 8. Starting with the test-and-test_and_set lock of Figure 13.8, implement busy-wait code that will allow readers to access a data structure concurrently. Writers will still need to lock out both readers and other writers. You may use any reasonable atomic instruction or instructions, such as LL/SC. Consider the issue of fairness. In particular, if there are always readers interested in accessing the data structure, your algorithm should ensure that writers are not locked out forever. 9. Assuming the Java memory model, (a) Explain why it is not sufficient in Figure 13.11 to label X and Y as volatile. (b) Explain why it is sufficient, in that same figure, to enclose C’s reads, and similarly those of D, in a synchronized block for some common shared object O. (c) Explain why it is sufficient, in Example 13.31, to label both inspected and X as volatile, but not to label only one.

Hint: You may find it useful to consult Doug Lea’s Java Memory Model “Cookbook for Compiler Writers.” 10. Implement the nonblocking queue of Example 13.30 on an x86. Complete pseudocode can be found in the paper by Michael and Scott. Do you need fence instructions to ensure consistency? If you have access to appropriate hardware, port your code to a machine with a more relaxed memory model, such as ARM or Power. What new fences or atomic references do you need? 11. Consider the implementation of software transactional memory in Figure 13.19. (a) How would you implement the read_set, write_map, and lock_map data structures? You will want to minimize the cost not only of insert and lookup operations but also of zeroing out the table at the end of a transaction, so it can be used again, and extending the table if it becomes too full. (b) The validate routine is called in two different places. Expand these calls in-line and customize them to the calling context. What optimizations can you achieve? (c) Optimize the commit routine to exploit the fact that a final validation is unnecessary if no other transaction has committed since valid_time. (d) Further optimize commit by observing that the for loop in the finally clause really needs to iterate over orecs, not over addresses. There may be a difference, if more than one address hashes to the same orec. What data, ideally, should lock_map hold? 12. The code of Example 13.35 could fairly be accused of displaying poor abstraction. If we make desired_condition a delegate, a subroutine or object closure, can we pass it as an extra parameter, and move the signal and scheduler lock management inside sleep_on? Hint: Consider the code for the P operation in Figure 13.15. 13. The mechanism used in Figure 13.13 to make scheduler code reentrant employs a single OS-provided lock for all the scheduling data structures of the application. Among other things, this mechanism prevents threads on separate processors from performing P or V operations on unrelated semaphores, even when none of the operations needs to block. Can you devise another synchronization mechanism for scheduler-related operations that admits a higher degree of concurrency but that is still correct? 14. Show how to implement a lock-based concurrent set as a singly linked sorted list. Your implementation should support insert, find, and remove operations, and should permit operations on separate portions of the list to occur concurrently, so a single lock for the entire list will not suffice. Hint: You will want to use a “walking lock” idiom in which acquire and release operations are interleaved in non-LIFO order. 15. Difficult: Implement a nonblocking version of the set of the previous exercise. Hint: You will probably discover that insertion is easy but deletion is hard. Consider a lazy deletion mechanism in which cleanup, physical removal of a node, may occur well after logical completion of the removal. 16. To make spin locks useful on a multiprogrammed multiprocessor, one might want to ensure that no process is ever preempted in the middle of a critical section. That way it would always be safe to spin in user space, because the process holding the lock would be guaranteed to be running on some other processor, rather than preempted and possibly in need of the current processor. Explain why an operating system designer might not want to give user processes the ability to disable preemption arbitrarily. Hint: Think about fairness and multiple users. Can you suggest a way to get around the problem? 17. Show how to use semaphores to construct a scheduler-based n-thread barrier. 18. Prove that monitors and semaphores are equally powerful. That is, use each to implement the other. In the monitor-based implementation of semaphores, what is your monitor invariant? 19. Show how to use binary semaphores to implement general semaphores. 20. In Example 13.38, Figure 13.15, suppose we replaced the middle four lines of procedure P with

if S.N = 0
     sleep on(S.Q)
S.N -= 1

and the middle four lines of procedure V with

S.N += 1
if S.Q is nonempty
     enqueue(ready list, dequeue(S.Q))

What is the problem with this new version? Explain how it connects to the question of hints and absolutes in Section 13.4.1. 21. Suppose that every monitor has a separate mutual exclusion lock, so that different threads can run in different monitors concurrently, and that we want to release exclusion on both inner and outer monitors when a thread waits in a nested call. When the thread awakens it will need to reacquire the outer locks. How can we ensure its ability to do so? Hint: Think about the order in which to acquire locks, and be prepared to abandon Hoare semantics. 22. Show how general semaphores can be implemented with conditional critical regions in which all threads wait for the same condition, thereby avoiding the overhead of unproductive wake-ups. 23. Write code for a bounded buffer using the protected object mechanism of Ada 95. 24. Repeat the previous exercise in Java using synchronized statements or methods. Try to make your solution as simple and conceptually clear as possible. You will probably want to use notifyAll. 25. Give a more efficient solution to the previous exercise that avoids the use of notifyAll. Warning: It is tempting to observe that the buffer can never be both full and empty at the same time, and to assume therefore that waiting threads are either all producers or all consumers. This need not be the case, however: if the buffer ever becomes even a temporary performance bottleneck, there may be an arbitrary number of waiting threads, including both producers and consumers. 26. Repeat the previous exercise using Java Lock variables. 27. Explain how escape analysis, mentioned briefly in Sidebar 10.3, could be used to reduce the cost of certain synchronized statements and methods in Java. 28. The dining philosophers problem is a classic exercise in synchronization. Five philosophers sit around a circular table. In the center is a large communal plate of spaghetti. Each philosopher repeatedly thinks for a while and then eats for a while, at intervals of his or her own choosing. On the table between each pair of adjacent philosophers is a single fork. To eat, a philosopher requires both adjacent forks: the one on the left and the one on the right. Because they share a fork, adjacent philosophers cannot eat simultaneously.

Write a solution to the dining philosophers problem in which each philosopher is represented by a process and the forks are represented by shared data. Synchronize access to the forks using semaphores, monitors, or conditional critical regions. Try to maximize concurrency. 29. In the previous exercise you may have noticed that the dining philosophers are prone to deadlock. One has to worry about the possibility that all five of them will pick up their right-hand forks simultaneously, and then wait forever for their left-hand neighbors to finish eating.

Discuss as many strategies as you can think of to address the deadlock problem. Can you describe a solution in which it is provably impossible for any philosopher to go hungry forever? Can you describe a solution that is fair in a strong sense of the word, that is, in which no one philosopher gets more chance to eat than some other over the long term? 30. In some concurrent programming systems, global variables are shared by all threads. In others, each newly created thread has a separate copy of the global variables, commonly initialized to the values of the globals of the creating thread. Under this private globals approach, shared data must be allocated from a special heap. In still other programming systems, the programmer can specify which global variables are to be private and which are to be shared.

Discuss the tradeoffs between private and shared global variables. Which would you prefer to have available, for which sorts of programs? How would you implement each? Are some options harder to implement than others? To what extent do your answers depend on the nature of processes provided by the operating system? 31. Rewrite Example 13.51 in Java. 32. AND parallelism in logic languages is analogous to the parallel evaluation of arguments in a functional language, for example Multilisp. Does OR parallelism have a similar analog? Hint: Think about special forms. Can you suggest a way to obtain the effect of OR parallelism in Multilisp? 33. In Section 13.4.5 we claimed that both AND parallelism and OR parallelism were problematic in Prolog, because they failed to adhere to the deterministic search order required by language semantics. Elaborate on this claim. What specifically can go wrong? 34. 13.34-13.38 In More Depth.

Takashi's Notes

Explorer

Concurrency

Concurrency

13.1 Check Your Understanding

13.2 Check Your Understanding

13.3 Check Your Understanding

13.4 Check Your Understanding

13.7 Exercises

Graph View

Table of Contents

Backlinks