Finding peaks with SIMD

code
julia
Author

Allen Hill

Published

March 2, 2026

A couple of years ago, I added a benchmarks page to the documentation for my Peaks.jl package (a Julia package for finding peaks in vector data). When I compared the performance of my Peaks.jl functions against other obvious peak finding implementations, I was shocked to discover that some of the other functions were faster!1 🙀 This completely unacceptable situation was just the encouragement nerd-snipe I needed to develop the new fastest™ peak finding function.2 Considering the relative simplicity of the existing peak finding algorithm, any meaningfully faster solution would have to involve explicit SIMD vectorization.3

SIMD stands for single-instruction, multiple-data. CPUs run software by executing one instruction at a time.4 Normal CPU instructions operate on scalar values (e.g. add one number to another), but many CPUs have SIMD extensions which add specialized instructions that can perform common operations (addition, multiplication, comparison, etc) on vector values (e.g. add 4 pair of numbers). This means the CPU does more work at once, which can be very handy to speed up software! Unfortunately, using SIMD forces a different, less-flexible style of programming: it’s harder/not possible to use branching (think if statements), and SIMD instructions operate on a fixed-width vector (e.g. 4 pair of numbers are always added, and if less than 4 pair are available, special handling or a non-SIMD fallback is needed).

Conveniently, I’ve wanted to learn SIMD for a while, and I just needed a personally relevant, real problem to experiment on. Some features can’t be supported in a SIMD algorithm (e.g. no Missing support)5, but it was non-negotiable that both SIMD and non-SIMD functions maintained the same basic definition of a peak:

A peak MUST be preceded by lesser element(s), MAY be followed by equal element(s), and MUST be ultimately followed by lesser element(s).

A peak where the “largest” element(s) occurs more than once is called a “plateau”. The location of a plateau is defined as the first largest element; this definition allows appending data without altering the location of an existing peak.

I’m using Julia in this post (since I’m describing the algorithm/code from my Peaks.jl package), but the concepts should transfer to any language that allows explicit use of SIMD. We’ll use this signal to work out examples:

An example signal containing a normal peak, a “false” plateau, and a true plateau

Visually, we can recognize a normal peak at index 2. Next, there is a “false” plateau at index 4, where the signal rises, plateaus, but ultimately rises again (i.e. doesn’t meet the definition for a plateau peak). Finally, there is a true plateau at index 6.

Based on the definition of a peak, we need 3 comparisons for each element. The first part of the function/SIMD loop will need to set up, load the data, and perform the SIMD comparisons; like this:

1offset₋₁ = VecRange{8}(-1)
offset   = VecRange{8}( 0)
offset₊₁ = VecRange{8}(+1)

for i in firstindex(x)+1:8:lastindex(x)-1
2  xᵢ₋₁ = x[i + offset₋₁]
  xᵢ   = x[i + offset]
  xᵢ₊₁ = x[i + offset₊₁]

3  rise = bitmask(xᵢ₋₁  < xᵢ)
  fall = bitmask(xᵢ₊₁  < xᵢ)
  flat = bitmask(xᵢ₊₁ == xᵢ)

  # peak identification happens here
end
1
Define SIMD ranges to use as offsets when loading signal elements into SIMD registers.6 These are SIMD specific equivalents of normal ranges (e.g. VecRange{8}(-1) is functionally equivalent to -1:6).
2
Load (8) signal elements from a range of indices into special SIMD vector registers (e.g. during the first loop, xᵢ₋₁ will load elements 1:8 from x, xᵢ₋₁ == x[2 .+ -1:6]).7
3
The comparisons produce SIMD vectors holding 8 Bools, each of which holds the result (true or false) of a single comparison. The bitmask function converts each Bool in the vector into a single bit 1 or 0 in a byte (or a larger/wider bitset—a set of bits). Since many CPUs have native hardware support for 64- or 32-bit values, we can later process many more elements/peaks at a time using bitwise and other basic math operators.

Bitwise operators are functions (compiled to single CPU instructions) that are focused on manipulating the bits in a value/bitset; most bitwise operators apply logical operations to every bit in a bitset at once. There are bitwise operators for all common logic operations: AND (operator symbol is &), OR (|), NOT (~), etc. Some other operators include left (<<) and right (>>) shifts, which add a number of 0 bit(s) before the least-significant bit (LSB) or after the most-significant bit (MSB), and drop the extra bit(s), e.g. 0b00010101 >> 1 shifts 0b00010101 right by one bit, resulting in 0b00001010.

Bitwise operations normally affect every bit the same way, but sometimes this is undesirable. A common technique is the use of “bitmasks” to control which bit(s) in a set should or shouldn’t be changed by the code. The only difference between a bitmask and a bitset is purpose/intent.

Using the above code on our example signal produces the following bitsets for the three comparisons:

# results shown as a bitstrings
julia> rise = bitmask(xᵢ₋₁ < xᵢ)
0b00010101

julia> fall = bitmask(xᵢ₊₁ < xᵢ)
0b10000001

julia> flat = bitmask(xᵢ₊₁ == xᵢ)
0b01110100

Bit ordering in bitsets follows index ordering, such that the least-significant bit (LSB) in a bitset corresponds to the lowest index in that slice of data. Bitstrings are printed most-significant bit (MSB) to LSB, from left-to-right, so the LSB is the rightmost digit.

We know from looking at the signal plot that there are peaks at indices 2 and 6, which means that by the end of our SIMD loop, we should have a bitset with set (1/true) bits at 1 and 5, 0b00010001.8 Study the three bitsets, and you might notice that different approaches are needed to identifying peaks and plateaus.

For normal peaks, every given bit position in rise and fall corresponds to a matched scalar comparison, xᵢ₋₁ < xᵢ and xᵢ₊₁ < xᵢ, for a given i. Normal peaks must have the same set bits in rise AND fall to indicate a peak (i.e. the previous and next elements are less than the current element). Using a bitwise & will produce set bits only if both bits in the pair are set (1/true). So, the simple equation rise & fall will keep only those matched bits, and all other bits will be cleared; for our signal, bit 1 is the only set bit in both rise and fall, resulting in 0b00000001. One peak identified, one peak to go.

Unfortunately, identifying plateaus is trickier. After the normal peaks are identified with rise & fall, fall can be right shifted to align the signal dips with the ends of plateaus. But the set bits in rise and fall for the beginning and end of plateaus will still not be aligned (because the flat part of a plateau might run for multiple elements), so a simple combination like rise & flat & fall won’t work: the same bits might be set in rise and flat, but the set bit in fall could be later.

Every plateau will have a bit run (one or more consecutively set bits, if the plateau has multiple equal values) where the run’s LSB is set in rise and the run’s MSB is set in fall. Multiple plateaus might be present in a flat bitset, and the LSB of each bit run is that peak’s location (bit 5 in our example flat).

We can filter “false” plateaus from flat by first clearing any bit runs that don’t have a set bit in rise matching the start (LSB) bit of the run. Next, clear any bit runs that don’t have a set bit in fall matching the end (MSB) bit of the run. Finally, clear all the remaining bits in flat except for each run’s LSB. We’ll need a few bitwise focused functions to filter the flat bitset as described:

  1. Functions to clear bit runs that don’t begin or end with an LSB/MSB mask
    • lsb_mask(0b01110100,0b00010101)​ ==0b01110100
    • msb_mask(0b01110100,0b01000000) ==0b01110000
  2. A function to clear all bits except for the (potentially multiple) local LSB of bit runs (e.g. run_lsb(0b01110) ==0b00010)

Masking bit runs with the run LSB

This function can clear any bit runs in flat that don’t begin with a set bit in rise: bits & ~(bits + mask). The premise is that adding the LSB of any bit run will (by carrying) invert all previously set bits in the run. So, by adding the LSB of only the bit runs we want and then inverting the bitset, we get a bitmask that can be used with an AND op to clear all unwanted bit runs.

You can click on the bits in the bits and mask bitstrings, or change the hex for the byte! Watch how the function clears bit runs that don’t have their LSB set in the mask.

Masking bit runs with the run MSB

I originally didn’t have a complete understanding of the problem space, and missed the importance of carrying (particularly carry direction) to generalize the LSB mask function for MSB masks. I made do with bitreverse operations,9 but the downside was 3 relatively10 expensive bitreverse calls. Off-and-on for several years, I thought about how to better solve this, until I recently finally connected the dots that carrying and carry direction were the core of the problem. With the proper language to describe what I needed, some research led me to adder algorithms. With a little prompting, Claude produced a reverse-carry adder based on the Kogge-Stone algorithm, but using right shifts.11

That accomplished, we can substitute the reverse-carry adder into the same equation used for the LSB mask: bits & ~(bits ⊞ mask), where I’m using as the symbol for the reverse-carry adder. The new reverse-carry function is ~2x faster than the original bitreverse based function!12 This new function is a direct corollary of the last function: it can clear any bit runs in flat that don’t end with a set bit in fall.

Calculating the LSB of bit runs

This function can be achieved with x & ~(x << 1). It is based on the fact that the LSB of every bit run is preceded by an unset bit (or will be after a left shift). Zeros become ones when inverted, so after shifting and inverting, the only set bits in both the original and manipulated bitset are the LSB of each bit run. Calling this function on a flat bitmask (that we’ve already filtered to only contain real plateaus) gives us the plateau locations.

Full algorithm

With the new bit-twiddling functions, the complete (ish) SIMD algorithm looks something like:

I’m glossing over the fussy auxiliary code to handle signal lengths that don’t work with the 8 element chunking, and plateaus that span multiple bytes/words.
offset₋₁ = VecRange{8}(-1)
offset   = VecRange{8}( 0)
offset₊₁ = VecRange{8}(+1)

for i in firstindex(x)+1:8:lastindex(x)-1
  xᵢ₋₁ = x[i + offset₋₁]
  xᵢ   = x[i + offset]
  xᵢ₊₁ = x[i + offset₊₁]

  rise = bitmask(xᵢ₋₁  < xᵢ)
  fall = bitmask(xᵢ₊₁  < xᵢ)
  flat = bitmask(xᵢ₊₁ == xᵢ)

1  peaks = rise & fall

2  fall >>= 1
3  flat = runs_lsb_mask(flat, rise)
  flat = runs_msb_mask(flat, fall)

4  peaks |= run_lsb(flat)
end
1
Capture normal peaks
2
Right shift fall bitmask to align signal dips with the end of plateaus
3
Clear bit runs that don’t begin or end with set bits in rise and fall
4
Add plateau locations to peaks.

Ironically, the code using explicit SIMD was the most straightforward part of the code! Instead, the bitwise manipulation after the SIMD was the trickiest part. In the full SIMD algorithm, I use a pre-allocated BitVector to store the peaks bitsets. Using a BitVector has the double benefit of taking the peaks bitsets directly13 (without needing to convert its representation) and avoiding allocations in the core loop.

Benchmarks

How much faster is the SIMD simplemaxima compared to the scalar argmaxima? Using random (normal) data, here’s the average time per peak for each function for different signal lengths. Check out the benchmarks page for more benchmark results.

Benchmark plot showing simplemaxima is ~7x faster than argmaxima across vector lengths
Caution🤖 AI disclosure

The algorithms and writing in this post were produced without AI. An LLM (Claude) helped write the code for the reverse-carry adder, the interactive bitwise math widgets, and the benchmark plot (which is adapted from the benchmark plot at benchmarks page).

Footnotes

  1. These comparisons aren’t completely apples-to-apples, and technically disfavor Peaks.jl functions, which have a larger feature set. In particular, none of the compared functions detect plateau peaks.↩︎

  2. Even though I have never needed faster peak finding, or used peaks in a real-time/performance sensitive context 🙄↩︎

  3. Many compilers, including Julia, can auto-vectorize simple/recognizable code patterns. More complex code must be intentionally and manually refactored to explicitly use SIMD.↩︎

  4. I realize that’s not strictly true with superscalar processors; don’t @ me.↩︎

  5. I decided the strict keyword argument and larger window sizes were too difficult and/or not possible to support.↩︎

  6. Registers are where CPUs store data for instructions to immediately use.↩︎

  7. Not all hardware will support a so-called vector “width” of e.g. 8 double-precision floats (which would be 512 bits wide). SIMD.jl and LLVM are very helpfully papering over differences between the code and what is actually supported by the hardware.↩︎

  8. Given the definition of a peak, the first possible peak location is at index 2, so the first bit corresponds to that index.↩︎

  9. bitreverse reverses the order of a bitset, e.g. 0b01110100 becomes 0b00101110. By reversing the bits and mask, we can use the runs_lsb_mask function to achieve the desired behavior:

    bitreverse(
      runs_lsb_mask(
        bitreverse(bits),
        bitreverse(mask)
      )
    )
    ↩︎
  10. The bitreverseing took all of 7% of the SIMD function runtime!!↩︎

  11. I also realized that the lack of this operation (reverse-carry adder) was the only major function/algorithmic gap blocking support for larger window widths.↩︎

  12. Otherwise there’s not much point. The perfect™ abstraction doesn’t matter if its slow or inconvenient.↩︎

  13. Unfortunately this does require accessing/using BitVector internals.↩︎