Link to paper

The full paper is available here.

You can also find the paper on PapersWithCode here.

Abstract

  • Recent works have shown that Quicksort implementations using vector CPU instructions can outperform non-vectorized algorithms.
  • The proposed ‘vqsort’ algorithm integrates into the state-of-the-art parallel sorter ‘ips4o’, with a geometric mean speedup of 1.59.
  • It works on seven instruction sets across four platforms, and supports floating-point and 16-128 bit integer keys.
  • It is the fastest sort for non-tuple keys on CPUs, up to 20 times as fast as the sorting algorithms implemented in standard libraries.

Paper Content

Introduction

  • Fundamental properties of CPUs require software to be designed to utilize SIMD and/or vector extensions
  • Replacing Quicksort from a standard library by a vectorized Mergesort implementation can reduce energy usage by a factor of six
  • Developing SIMD software involves specialized domain expertise
  • There are five major instruction sets across three architectures
  • Autovectorization is an appealing option but re-ordering vector lanes is infeasible
  • Highway library is used as an abstraction layer over platform-specific intrinsics
  • Mergesort is commonly used but typically requires O(N) extra storage
  • Radixsort scatters keys to separate arrays but has not been implemented for SIMD/vectors
  • Vectorized Quicksort is cache-friendly and requires less memory bandwidth
  • vqsort is the fastest sorting implementation known for commercially available shared-memory machines
  • vqsort supports seven instruction sets with close-to native performance
  • vqsort is open-sourced, tested, boundschecked, documented and works with three major compilers

Vectorized quicksort

  • Quicksort is a simple algorithm that recursively sorts arrays
  • Performance of Quicksort depends on the choice of pivot element
  • Portable partitioning is faster than AVX-512 specific code
  • Vectorized, cache-aware, robust pivot sampling
  • For small array sizes, alternative sorting algorithms/strategies are used
  • Vectorized sorting network sorts 256 keys in several hundred CPU cycles

Partition

  • Partitioning the input array is defined as moving elements which compare less than or equal to the pivot argument before the other elements
  • This accounts for a large majority of compute time
  • An AVX-512 instruction is used to partition
  • To maintain the invariant, inputs are loaded from the left or right side
  • To establish the invariant before the loop, the first and last vectors of the input are loaded to registers
  • Unrolling the loop is crucial for performance
  • An additional loop is used to partition small arrays
  • To prevent errors, the last vector of inputs is loaded into a register
  • The result of these efforts is portable code that outperforms AVX-512-specific code by a factor of 1.7

Pivot selection

  • ChoosePivot returns the pivot for Recurse and Partition
  • Published Quicksort implementations use medians of constant-sized samples
  • Adaptation needed for vectors and caches
  • Load nine 64-byte chunks from random 64-byte-aligned offsets
  • Reduce elements to single median using medians of three
  • 64-byte chunk size corresponds to L1 cache line size
  • Generate random bits using SFC64
  • Division-free modulo algorithm used to obtain offsets
  • Reduce buffer to single median
  • Impose limit of 2•log 2 (n)+4 recursions
  • Switch to Heapsort if limit exceeded
  • Secure random generator used to prevent malicious input

Base case

  • Quicksort is commonly optimized by handling small arrays separately.
  • Sorting networks built on vector instructions can have lower constant factors than other algorithms.
  • Vector instructions require handling input arrays that don’t evenly divide the vector size.
  • A buffer is used to store the sorted results and must be large enough to fit nine vectors or four chunks plus two vectors.

Sort order and 128-bit keys

  • User-specified comparators interact poorly with runtime dispatch.
  • We call the best available implementation through an indirect pointer.
  • We generalize comparisons to enable sorting in ascending or descending order.
  • We take advantage of Highway’s 128-bit vectors to treat pairs of 64-bit lanes as unsigned 128-bit numbers.
  • We reproduce the x86 implementation in Algorithm 2.
  • We integrate emulated 128-bit comparisons into the same abstraction.

Sorting networks

  • Sorting arrays in the ‘base case’ (n ≤ 256) is done with sorting networks
  • Compare-and-exchange modules are the building blocks of sorting networks
  • Values in the modules are sorted using min and max operations
  • Elements are copied into an aligned buffer and interpreted as a matrix
  • Vectorization strategy for sorting networks involves sorting columns with sorting networks and merging sorted columns with vectorized Bitonic Merge networks
  • Showcase example has capacity of elements in a vector limited to four and a total of 16 elements to be sorted
  • Sorting values within columns of a matrix with sorting networks is vector-friendly
  • Vectorized compare-and-exchange operations execute the same compare-and-exchange module in all columns simultaneously
  • Merging sorted columns or sorted submatrices involves permuting the values of vectors
  • Memory bandwidth is usually the limiting factor for the performance of vectorized software

More bandwidth-friendly algorithms

  • Quicksort splits N inputs into two partitions, requiring log 2 (N ) recursions
  • Scattering inputs into K partitions changes the base of the logarithm to K
  • Compressing vector lanes and storing to each partition is unsuitable for K ≥ 8 and current vector lengths of 512-bits
  • With 64-bit keys, throughput would be limited to 6 GB/s
  • K = 4 was previously found to be helpful in a non-vectorized context
  • Vectorized compress with K = 4 reaches about half the speed of K = 2
  • Samplesort is a very large (K = 256) generalization
  • ips4o scales better than vqsort but is slower in aggregate for less than 19 threads
  • ips4o executes nearly five times as many instructions as vqsort
  • Switching to vqsort after initial recursions of ips4o improves scalability
  • Single instance of ips4o’s parallel mode using 16 threads is less bandwidth-intensive
  • Hybrid is 1.59 times as fast as ips4o in single instance
  • Hybrid is 2.89 times as fast as ips4o in single core with near-exclusive usage of L3 cache
  • vqsort using AVX-512 is 1.5 to 2.0 times as fast as on AVX2

Performance portability

  • Performance portability means running on different platforms and being efficient.
  • We tested the same source code and benchmark on an Apple M1 Max system.
  • Results are not directly comparable due to different clock rates.
  • With the M1’s 128-bit vectors and older NEON instruction set, there was a 3-8x speedup over the standard library.
  • vqsort is practical and useful on multiple architectures and instruction sets.

Limitations

  • Sort keys must be 16/32/64-bit integers, floating-point numbers, or pairs of 64bit numbers
  • VBMI instruction set needed to re-order 8-bit elements across a register
  • Excludes applications that need to sort tuples or large items with custom comparators
  • Surveyed uses of sorting in Google’s production workloads and found sorting numbers more costly than sorting strings or user-defined types

Conclusions

  • Used Highway cross-platform abstraction layer to implement vqsort
  • New recursive sorting network for up to 256 elements
  • Vector-friendly pivot sampling
  • vqsort is fastest sort for individual keys on AVX2 and AVX-512
  • 3-8x speedup versus standard library on Apple M1 hardware
  • Integrating vqsort into ips4o yields 1.59x speedup
  • Supports 32/64-bit floatingpoint and 16/32/64/128-bit integer keys
  • Supports seven instruction sets with close-to native performance