Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

Recent works have shown that Quicksort implementations using vector CPU instructions can outperform non-vectorized algorithms.
The proposed ‘vqsort’ algorithm integrates into the state-of-the-art parallel sorter ‘ips4o’, with a geometric mean speedup of 1.59.
It works on seven instruction sets across four platforms, and supports floating-point and 16-128 bit integer keys.
It is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries.

Paper Content

Introduction

Fundamental properties of CPUs require software to be designed to utilize SIMD and/or vector extensions
Replacing Quicksort from a standard library by a vectorized Mergesort implementation can reduce energy usage by a factor of six
Developing SIMD software involves specialized domain expertise
There are five major instruction sets across three architectures
Autovectorization is an appealing option but re-ordering vector lanes is infeasible
Highway library is used as an abstraction layer over platform-specific intrinsics
Mergesort is commonly used but typically requires O(N) extra storage
Radixsort scatters keys to separate arrays but has not been implemented for SIMD/vectors
Vectorized Quicksort is cache-friendly and requires less memory bandwidth
vqsort is the fastest sorting implementation known for commercially available shared-memory machines
vqsort supports seven instruction sets with close-to native performance
vqsort is open-sourced, tested, boundschecked, documented and works with three major compilers

Vectorized quicksort

Quicksort is a simple algorithm that recursively sorts arrays
Performance of Quicksort depends on the choice of pivot element
Portable partitioning is faster than AVX-512 specific code
Vectorized, cache-aware, robust pivot sampling
For small array sizes, alternative sorting algorithms/strategies are used
Vectorized sorting network sorts 256 keys in several hundred CPU cycles

Partition

Partitioning the input array is defined as moving elements which compare less than or equal to the pivot argument before the other elements
This accounts for a large majority of compute time
An AVX-512 instruction is used to partition
To maintain the invariant, inputs are loaded from the left or right side
To establish the invariant before the loop, the first and last vectors of the input are loaded to registers
Unrolling the loop is crucial for performance
An additional loop is used to partition small arrays
To prevent errors, the last vector of inputs is loaded into a register
The result of these efforts is portable code that outperforms AVX-512-specific code by a factor of 1.7

Pivot selection

ChoosePivot returns the pivot for Recurse and Partition
Published Quicksort implementations use medians of constant-sized samples
Adaptation needed for vectors and caches
Load nine 64-byte chunks from random 64-byte-aligned offsets
Reduce elements to single median using medians of three
64-byte chunk size corresponds to L1 cache line size
Generate random bits using SFC64
Division-free modulo algorithm used to obtain offsets
Reduce buffer to single median
Impose limit of 2•log 2 (n)+4 recursions
Switch to Heapsort if limit exceeded
Secure random generator used to prevent malicious input

Base case

Quicksort is commonly optimized by handling small arrays separately.
Sorting networks built on vector instructions can have lower constant factors than other algorithms.
Vector instructions require handling input arrays that don’t evenly divide the vector size.
A buffer is used to store the sorted results and must be large enough to fit nine vectors or four chunks plus two vectors.

Sort order and 128-bit keys

User-specified comparators interact poorly with runtime dispatch.
We call the best available implementation through an indirect pointer.
We generalize comparisons to enable sorting in ascending or descending order.
We take advantage of Highway’s 128-bit vectors to treat pairs of 64-bit lanes as unsigned 128-bit numbers.
We reproduce the x86 implementation in Algorithm 2.
We integrate emulated 128-bit comparisons into the same abstraction.

Sorting networks

Sorting arrays in the ‘base case’ (n ≤ 256) is done with sorting networks
Compare-and-exchange modules are the building blocks of sorting networks
Values in the modules are sorted using min and max operations
Elements are copied into an aligned buffer and interpreted as a matrix
Vectorization strategy for sorting networks involves sorting columns with sorting networks and merging sorted columns with vectorized Bitonic Merge networks
Showcase example has capacity of elements in a vector limited to four and a total of 16 elements to be sorted
Sorting values within columns of a matrix with sorting networks is vector-friendly
Vectorized compare-and-exchange operations execute the same compare-and-exchange module in all columns simultaneously
Merging sorted columns or sorted submatrices involves permuting the values of vectors
Memory bandwidth is usually the limiting factor for the performance of vectorized software

More bandwidth-friendly algorithms

Quicksort splits N inputs into two partitions, requiring log 2 (N ) recursions
Scattering inputs into K partitions changes the base of the logarithm to K
Compressing vector lanes and storing to each partition is unsuitable for K ≥ 8 and current vector lengths of 512-bits
With 64-bit keys, throughput would be limited to 6 GB/s
K = 4 was previously found to be helpful in a non-vectorized context
Vectorized compress with K = 4 reaches about half the speed of K = 2
Samplesort is a very large (K = 256) generalization
ips4o scales better than vqsort but is slower in aggregate for less than 19 threads
ips4o executes nearly five times as many instructions as vqsort
Switching to vqsort after initial recursions of ips4o improves scalability
Single instance of ips4o’s parallel mode using 16 threads is less bandwidth-intensive
Hybrid is 1.59 times as fast as ips4o in single instance
Hybrid is 2.89 times as fast as ips4o in single core with near-exclusive usage of L3 cache
vqsort using AVX-512 is 1.5 to 2.0 times as fast as on AVX2

Performance portability

Performance portability means running on different platforms and being efficient.
We tested the same source code and benchmark on an Apple M1 Max system.
Results are not directly comparable due to different clock rates.
With the M1’s 128-bit vectors and older NEON instruction set, there was a 3-8x speedup over the standard library.
vqsort is practical and useful on multiple architectures and instruction sets.

Limitations

Sort keys must be 16/32/64-bit integers, floating-point numbers, or pairs of 64bit numbers
VBMI instruction set needed to re-order 8-bit elements across a register
Excludes applications that need to sort tuples or large items with custom comparators
Surveyed uses of sorting in Google’s production workloads and found sorting numbers more costly than sorting strings or user-defined types

Conclusions

Used Highway cross-platform abstraction layer to implement vqsort
New recursive sorting network for up to 256 elements
Vector-friendly pivot sampling
vqsort is fastest sort for individual keys on AVX2 and AVX-512
3-8x speedup versus standard library on Apple M1 hardware
Integrating vqsort into ips4o yields 1.59x speedup
Supports 32/64-bit floatingpoint and 16/32/64/128-bit integer keys
Supports seven instruction sets with close-to native performance

Link to paper#

Abstract#

Paper Content#

Introduction#

Vectorized quicksort#

Partition#

Pivot selection#

Base case#

Sort order and 128-bit keys#

Sorting networks#

More bandwidth-friendly algorithms#

Performance portability#

Limitations#

Conclusions#

Link to paper

Abstract

Paper Content

Introduction

Vectorized quicksort

Partition

Pivot selection

Base case

Sort order and 128-bit keys

Sorting networks

More bandwidth-friendly algorithms

Performance portability

Limitations

Conclusions