Abstract
Threads ("lightweight" processes) have become a common element of new languages and
operating systems. This paper examines the performance implications of several data
structure and algorithm alternatives for thread management in shared-memory multi-
processors. Both experimental measurements and analytical model projections are
presented.
For applications with fine-grained parallelism, small differences in thread management
are shown to have significant performance impact, often posing a tradeoff between
throughput and latency. Pre-processor data structures can be used to improve
throughput, and in some circumstances to avoid locking, improving latency as well.
The method used by processors to queue for locks is also shown to affect performance
significantly. Normal methods of critical resource waiting can substantially degrade
performance with moderate numbers of waiting processors. We present an Ethernet-style
backoff algorithm that largely eliminates this effect.