System Support for Concurrent Software Reliability
Lucia, Brandon M.
MetadataShow full item record
Parallel and concurrent software is more complex than sequential code because interactions between concurrent computations and the ordering of program events can vary across ex- ecutions. This nondeterministic variation is hard to understand and control, introducing the potential for concurrency bugs. This dissertation addresses two challenges related to concurrency bugs, focusing on shared-memory multi-threaded programs. First, concurrency bugs are hard to find, understand, and fix, but debugging is essential to software correctness. Second, concurrency bugs cause schedule-dependent failures that degrade system reliability. Targeting debugging, we develop two new concurrency debugging techniques based on statistical analysis and novel abstractions of inter-thread communication. These techniques isolate communications related to bugs and reconstruct failing executions. We show several hardware and software system designs that efficiently implement these techniques. Targeting the avoidance of schedule-dependent failures, we then develop two techniques for automatically avoiding schedule-dependent failures due to atomicity violations, a common concurrent program failure. We use specialized serializability analyses to identify code that should be atomic and system support to enforce atomicity. We implement these techniques with architecture and system support. Finally, we develop a mechanism for general schedule-dependent failure avoidance. We use a statistical analysis and leverage large communities of deployed systems to learn how to constrain executions to avoid previously seen failures. We show a software-only distributed system implementation that avoids real software failures with overheads low enough for production use.