Link to paper
The full paper is available here.
You can also find the paper on PapersWithCode here.
Abstract
- Recent works have shown that Quicksort implementations using vector CPU instructions can outperform non-vectorized algorithms.
- The proposed ‘vqsort’ algorithm integrates into the state-of-the-art parallel sorter ‘ips4o’, with a geometric mean speedup of 1.59.
- It works on seven instruction sets across four platforms, and supports floating-point and 16-128 bit integer keys.
- It is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries.
Paper Content
Introduction
- Fundamental properties of CPUs require software to be designed to utilize SIMD and/or vector extensions
- Replacing Quicksort from a standard library by a vectorized Mergesort implementation can reduce energy usage by a factor of six
- Developing SIMD software involves specialized domain expertise
- There are five major instruction sets across three architectures
- Autovectorization is an appealing option but re-ordering vector lanes is infeasible
- Highway library is used as an abstraction layer over platform-specific intrinsics
- Mergesort is commonly used but typically requires O(N) extra storage
- Radixsort scatters keys to separate arrays but has not been implemented for SIMD/vectors
- Vectorized Quicksort is cache-friendly and requires less memory bandwidth
- vqsort is the fastest sorting implementation known for commercially available shared-memory machines
- vqsort supports seven instruction sets with close-to native performance
- vqsort is open-sourced, tested, boundschecked, documented and works with three major compilers
Vectorized quicksort
- Quicksort is a simple algorithm that recursively sorts arrays
- Performance of Quicksort depends on the choice of pivot element
- Portable partitioning is faster than AVX-512 specific code
- Vectorized, cache-aware, robust pivot sampling
- For small array sizes, alternative sorting algorithms/strategies are used
- Vectorized sorting network sorts 256 keys in several hundred CPU cycles
Partition
- Partitioning the input array is defined as moving elements which compare less than or equal to the pivot argument before the other elements
- This accounts for a large majority of compute time
- An AVX-512 instruction is used to partition
- To maintain the invariant, inputs are loaded from the left or right side
- To establish the invariant before the loop, the first and last vectors of the input are loaded to registers
- Unrolling the loop is crucial for performance
- An additional loop is used to partition small arrays
- To prevent errors, the last vector of inputs is loaded into a register
- The result of these efforts is portable code that outperforms AVX-512-specific code by a factor of 1.7
Pivot selection
- ChoosePivot returns the pivot for Recurse and Partition
- Published Quicksort implementations use medians of constant-sized samples
- Adaptation needed for vectors and caches
- Load nine 64-byte chunks from random 64-byte-aligned offsets
- Reduce elements to single median using medians of three
- 64-byte chunk size corresponds to L1 cache line size
- Generate random bits using SFC64
- Division-free modulo algorithm used to obtain offsets
- Reduce buffer to single median
- Impose limit of 2•log 2 (n)+4 recursions
- Switch to Heapsort if limit exceeded
- Secure random generator used to prevent malicious input
Base case
- Quicksort is commonly optimized by handling small arrays separately.
- Sorting networks built on vector instructions can have lower constant factors than other algorithms.
- Vector instructions require handling input arrays that don’t evenly divide the vector size.
- A buffer is used to store the sorted results and must be large enough to fit nine vectors or four chunks plus two vectors.
Sort order and 128-bit keys
- User-specified comparators interact poorly with runtime dispatch.
- We call the best available implementation through an indirect pointer.
- We generalize comparisons to enable sorting in ascending or descending order.
- We take advantage of Highway’s 128-bit vectors to treat pairs of 64-bit lanes as unsigned 128-bit numbers.
- We reproduce the x86 implementation in Algorithm 2.
- We integrate emulated 128-bit comparisons into the same abstraction.
Sorting networks
- Sorting arrays in the ‘base case’ (n ≤ 256) is done with sorting networks
- Compare-and-exchange modules are the building blocks of sorting networks
- Values in the modules are sorted using min and max operations
- Elements are copied into an aligned buffer and interpreted as a matrix
- Vectorization strategy for sorting networks involves sorting columns with sorting networks and merging sorted columns with vectorized Bitonic Merge networks
- Showcase example has capacity of elements in a vector limited to four and a total of 16 elements to be sorted
- Sorting values within columns of a matrix with sorting networks is vector-friendly
- Vectorized compare-and-exchange operations execute the same compare-and-exchange module in all columns simultaneously
- Merging sorted columns or sorted submatrices involves permuting the values of vectors
- Memory bandwidth is usually the limiting factor for the performance of vectorized software
More bandwidth-friendly algorithms
- Quicksort splits N inputs into two partitions, requiring log 2 (N ) recursions
- Scattering inputs into K partitions changes the base of the logarithm to K
- Compressing vector lanes and storing to each partition is unsuitable for K ≥ 8 and current vector lengths of 512-bits
- With 64-bit keys, throughput would be limited to 6 GB/s
- K = 4 was previously found to be helpful in a non-vectorized context
- Vectorized compress with K = 4 reaches about half the speed of K = 2
- Samplesort is a very large (K = 256) generalization
- ips4o scales better than vqsort but is slower in aggregate for less than 19 threads
- ips4o executes nearly five times as many instructions as vqsort
- Switching to vqsort after initial recursions of ips4o improves scalability
- Single instance of ips4o’s parallel mode using 16 threads is less bandwidth-intensive
- Hybrid is 1.59 times as fast as ips4o in single instance
- Hybrid is 2.89 times as fast as ips4o in single core with near-exclusive usage of L3 cache
- vqsort using AVX-512 is 1.5 to 2.0 times as fast as on AVX2
Performance portability
- Performance portability means running on different platforms and being efficient.
- We tested the same source code and benchmark on an Apple M1 Max system.
- Results are not directly comparable due to different clock rates.
- With the M1’s 128-bit vectors and older NEON instruction set, there was a 3-8x speedup over the standard library.
- vqsort is practical and useful on multiple architectures and instruction sets.
Limitations
- Sort keys must be 16/32/64-bit integers, floating-point numbers, or pairs of 64bit numbers
- VBMI instruction set needed to re-order 8-bit elements across a register
- Excludes applications that need to sort tuples or large items with custom comparators
- Surveyed uses of sorting in Google’s production workloads and found sorting numbers more costly than sorting strings or user-defined types
Conclusions
- Used Highway cross-platform abstraction layer to implement vqsort
- New recursive sorting network for up to 256 elements
- Vector-friendly pivot sampling
- vqsort is fastest sort for individual keys on AVX2 and AVX-512
- 3-8x speedup versus standard library on Apple M1 hardware
- Integrating vqsort into ips4o yields 1.59x speedup
- Supports 32/64-bit floatingpoint and 16/32/64/128-bit integer keys
- Supports seven instruction sets with close-to native performance