6. Building the Spectrum Analyzer Both

Section 5 gave you the theory. Now we write the code. By the end of this section, you'll have a SpectrumAnalyzer struct that takes a buffer of raw audio samples and returns an array of bar heights ready for display. Every line maps directly to a concept from the previous section — windowing, FFT, magnitude computation, log-frequency grouping. If anything feels unfamiliar, flip back to Section 5.

We'll also add peak frequency detection and a MIDI-based note name converter, then wire the whole thing into AudioEngine. At the end, the audio engine will publish spectrum data alongside the VU level you've already built.

The SpectrumAnalyzer Struct

Create a new Swift file in Xcode: File → New → File → Swift File, name it SpectrumAnalyzer.swift. Here's the struct and its initializer:

import Accelerate

struct SpectrumAnalyzer {
    let binCount: Int
    let sampleRate: Double
    private let fftSize: Int
    private let halfSize: Int
    private var fftSetup: vDSP.FFT<DSPSplitComplex>?
    private var window: [Float]

    init(binCount: Int = 48, sampleRate: Double = 44100, fftSize: Int = 4096) {
        self.binCount = binCount
        self.sampleRate = sampleRate
        self.fftSize = fftSize
        self.halfSize = fftSize / 2

        self.window = vDSP.window(ofType: Float.self,
                                  usingSequence: .hanningDenormalized,
                                  count: fftSize,
                                  isHalfWindow: false)

        let log2n = vDSP_Length(log2(Float(fftSize)))
        self.fftSetup = vDSP.FFT(log2n: log2n,
                                 radix: .radix2,
                                 ofType: DSPSplitComplex.self)
    }
}

Let's walk through each piece.

Public Configuration

binCount is the number of bars in our display — 48 by default. That's enough to show meaningful frequency detail without making each bar too thin on a phone screen. sampleRate must match the audio hardware's actual rate — the default of 44,100 Hz is a placeholder; we'll pass the real value from AudioEngine when we wire things together. fftSize is 4096 — the sweet spot we discussed in Section 5.

halfSize is fftSize / 2 (2048). We'll use this everywhere — it's the number of frequency bins the FFT produces.

The Hann Window

self.window = vDSP.window(ofType: Float.self,
                          usingSequence: .hanningDenormalized,
                          count: fftSize,
                          isHalfWindow: false)

This pre-computes the 4096-element Hann window we described in Section 5. The Accelerate framework generates the array of window coefficients for us — no manual cosine math needed. We store it as a property because the window never changes: we'll multiply it against every incoming buffer, and recomputing it each time would be wasteful.

.hanningDenormalized is Accelerate's name for the standard Hann window. The "denormalized" part means the values range from 0 to 1 without any energy-correction scaling. isHalfWindow: false means we want the full symmetric window, not just the first half.

The FFT Setup Object

let log2n = vDSP_Length(log2(Float(fftSize)))
self.fftSetup = vDSP.FFT(log2n: log2n,
                         radix: .radix2,
                         ofType: DSPSplitComplex.self)

This creates an FFT "plan" — a pre-computed data structure that Accelerate uses to execute the FFT efficiently. Think of it like compiling a regular expression before using it in a loop: you pay the setup cost once, then every subsequent FFT runs faster because the plan already knows the optimal memory layout and butterfly operations for this size.

The log2n parameter is the base-2 logarithm of the FFT size. For 4096, that's 12. Accelerate requires FFT sizes that are powers of 2 (hence "radix 2"), which is why we chose 4096 and not, say, 4000.

Integration Point

Pre-computing the window and FFT setup in init is important for real-time performance. Audio buffers arrive every ~93ms. If we had to allocate arrays and build FFT plans on every callback, we'd introduce jitter and potentially drop frames. By doing all allocation upfront, the per-buffer processing path is fast and allocation-free. This is the same principle as object pooling in C# game development — allocate once, reuse forever.

The process() Method — Step by Step

This is the core of the spectrum analyzer. The process method takes a raw audio buffer and returns an array of magnitudes — one per FFT bin. We'll build it in four numbered steps that correspond directly to the pipeline from Section 5.

Step 1: Padding and Windowing

mutating func process(buffer: [Float]) -> [Float] {
    // Step 1: Pad to fftSize if needed, then apply Hann window
    var samples = Array(buffer.prefix(fftSize))
    if samples.count < fftSize {
        samples += Array(repeating: 0, count: fftSize - samples.count)
    }
    vDSP.multiply(samples, window, result: &samples)

First, we take up to fftSize samples from the incoming buffer. If the buffer is shorter than 4096 (which can happen at startup or with certain audio configurations), we pad it with zeros. Zero-padding doesn't add information, but it ensures the FFT always gets the size it expects.

Then comes the windowing: vDSP.multiply performs element-wise multiplication of the samples with our pre-computed Hann window. Sample 0 gets multiplied by ~0 (the window is near-zero at the edges). Sample 2048 gets multiplied by ~1.0 (the window peaks at the center). This is the spectral leakage prevention from Section 5, implemented as a single function call.

Tip

In C#, element-wise array multiplication would look like for (int i = 0; i < N; i++) samples[i] *= window[i]; or maybe a LINQ Zip. The vDSP.multiply call does the same thing but uses SIMD instructions under the hood — it processes 4 or 8 floats per CPU cycle. For 4096 elements, this is essentially free.

Step 2: Split Complex Packing

    // Step 2: Pack into split complex format for vDSP
    var reals = [Float](repeating: 0, count: halfSize)
    var imags = [Float](repeating: 0, count: halfSize)
    let magnitudes: [Float] = reals.withUnsafeMutableBufferPointer { realsBP in
        imags.withUnsafeMutableBufferPointer { imagsBP in
            var splitComplex = DSPSplitComplex(realp: realsBP.baseAddress!,
                                               imagp: imagsBP.baseAddress!)
            samples.withUnsafeBytes { ptr in
                let floatPtr = ptr.bindMemory(to: DSPComplex.self)
                vDSP_ctoz(floatPtr.baseAddress!, 2, &splitComplex, 1,
                          vDSP_Length(halfSize))
            }

This is the gnarliest part of the code, so let's take it slowly.

Accelerate's FFT doesn't work with a plain array of floats. It wants data in split complex format — separate arrays for real parts and imaginary parts. For our real-valued input, the "imaginary" parts start as zeros. The vDSP_ctoz function converts from interleaved format (alternating real/imaginary pairs) to split format (one array of reals, one array of imaginaries).

All the withUnsafeMutableBufferPointer and withUnsafeBytes wrappers are Swift's safety system making you explicitly opt into raw pointer access. Swift doesn't let you take a pointer to an array's memory without these wrappers — they guarantee the array stays alive and pinned in memory for the duration of the closure. If you've used C#'s fixed statement to pin a managed array for P/Invoke, this is the same idea. If you've used Unsafe.As<T> or Span<T> with MemoryMarshal, same family of concepts.

The ! after baseAddress is a force-unwrap. baseAddress returns an optional pointer that's nil only if the buffer is empty — and we just created non-empty arrays, so this is safe. In production code, you might add a guard, but here we know the sizes are correct.

iOS Concept

The withUnsafe* pattern is Swift's way of saying: "I know you need raw pointers for this C-level API. I'll let you have them, but only inside this scope, and I'll manage the memory lifetime for you." It's verbose but safe. Once you've written it a couple of times, it becomes muscle memory. The Accelerate framework is a C API with a Swift overlay, so pointer wrangling at the boundary is unavoidable.

Step 3: Forward FFT

            // Step 3: Execute the FFT
            fftSetup?.forward(input: splitComplex, output: &splitComplex)

One line. All the theory from Section 5 — the decomposition of a time-domain signal into frequency components, the O(N log N) butterfly algorithm, the complex exponentials — happens right here. The 4096 windowed samples go in, and the split complex arrays now contain 2048 frequency bins, each with real and imaginary parts.

We're doing the FFT in-place (output: &splitComplex points to the same memory as the input). This saves an allocation. The reals and imags arrays, which started as our packed input, now hold the FFT output.

Step 4: Compute Magnitudes

            // Step 4: Compute magnitudes and normalize
            var mags = [Float](repeating: 0, count: halfSize)
            vDSP.absolute(splitComplex, result: &mags)
            vDSP.multiply(1.0 / Float(fftSize), mags, result: &mags)
            return mags
        }
    }
}

vDSP.absolute computes sqrt(real² + imag²) for each bin — that's the magnitude, which tells us the amplitude of each frequency component. We then divide by fftSize to normalize the values. Without normalization, the magnitudes scale with the FFT size, and doubling N would double all your values even though the signal didn't change.

The result is an array of 2048 floats. Each float represents the normalized amplitude at its corresponding frequency bin. Bin 0 is 0 Hz (DC), bin 1 is ~10.77 Hz, bin 2 is ~21.5 Hz, and so on up to bin 2047 at ~22,050 Hz (Nyquist).

That's the complete FFT pipeline in four steps. Take a buffer, window it, FFT it, compute magnitudes. About 15 lines of actual logic, wrapped in the pointer-safety boilerplate that Swift requires for C interop.

Log-Spaced Bars

We now have 2048 linearly-spaced frequency bins. We could display them directly — one thin bar per bin, 2048 bars total. That would be technically accurate and visually useless. Here's why.

FFT bins are linearly spaced: each bin is ~10.77 Hz wide. But human hearing is logarithmic. We perceive the distance between 100 Hz and 200 Hz (one octave) as the same as the distance between 1000 Hz and 2000 Hz (also one octave), even though the second span is ten times wider in Hz. On a linear frequency scale:

The first octave (60-120 Hz) spans about 6 bins
The last octave (9000-18000 Hz) spans about 836 bins

If you gave each bin equal visual weight, bass would be a thin sliver on the left and treble would dominate the entire display. That's the opposite of how we hear. Log spacing fixes this by giving each octave roughly equal visual width.

Add this method to SpectrumAnalyzer:

private func logSpacedBars(magnitudes: [Float]) -> [Float] {
    let minFreq: Float = 60
    let maxFreq: Float = 18000
    let logMin = log10(minFreq)
    let logMax = log10(maxFreq)

    return (0..<binCount).map { i in
        let logLow  = logMin + Float(i)     / Float(binCount) * (logMax - logMin)
        let logHigh = logMin + Float(i + 1) / Float(binCount) * (logMax - logMin)
        let freqLow  = pow(10, logLow)
        let freqHigh = pow(10, logHigh)

        let binLow  = Int(freqLow  / Float(sampleRate) * Float(fftSize))
        let binHigh = Int(freqHigh / Float(sampleRate) * Float(fftSize))
        let slice = magnitudes[max(0, binLow)...min(magnitudes.count - 1, max(binLow, binHigh))]

        let rms = slice.isEmpty ? 0 : slice.reduce(0, +) / Float(slice.count)
        let db = 20 * log10(max(rms, 1e-9))
        let normalized = (db + 80) / 80
        return max(0, min(1, normalized))
    }
}

Let's trace through the logic.

Frequency Range

We define a display range of 60 Hz to 18,000 Hz. Below 60 Hz there's mostly rumble and DC offset. Above 18,000 Hz, most adults can't hear anything. These boundaries become the left and right edges of our spectrum display.

Logarithmic Division

The key trick: we divide the frequency range in log space, not linear space. log10(60) ≈ 1.78, log10(18000) ≈ 4.26. We divide this log range into 48 equal slices. Each slice covers the same ratio of frequencies. Bar 0 might cover 60-72 Hz. Bar 47 might cover 15,000-18,000 Hz. In linear Hz, bar 47's range is 250 times wider than bar 0's. In perceptual terms, they're similar — each is roughly a fraction of an octave.

Bin Mapping

For each display bar, we convert its frequency bounds to FFT bin indices: bin = frequency * fftSize / sampleRate. This tells us which FFT bins fall within this bar's frequency range. We then take the average magnitude of those bins.

For the lowest bars, this might be just 1-2 FFT bins. For the highest bars, it could be dozens or hundreds. That's fine — we're averaging them down to a single value per bar.

dB Conversion and Normalization

The raw magnitudes from the FFT span a huge dynamic range. A quiet room might produce magnitudes of 0.0001 while a loud clap produces 0.5. Displaying these linearly would mean quiet sounds are invisible. So we convert to decibels: 20 * log10(magnitude). This compresses the range to something manageable.

The (db + 80) / 80 normalization maps the range -80 dB to 0 dB onto the range 0.0 to 1.0. Anything below -80 dB (effectively silence) clips to 0. Anything at 0 dB (full scale) maps to 1.0. The result is a value we can directly use as a bar height.

The max(rms, 1e-9) guard prevents log10(0), which would be negative infinity. A magnitude of 1e-9 corresponds to about -180 dB — so far below the noise floor it's irrelevant.

DSP Concept

The 80 dB display range is a common choice for spectrum analyzers. It means the tallest bar is about 10,000 times louder (in amplitude) than the shortest visible bar. Professional audio tools sometimes use 90 or 120 dB ranges, but 80 dB is a good balance for a phone screen where you want quiet content to still be visible.

Peak Frequency Detection

Finding the dominant frequency in the spectrum is straightforward: find the bin with the highest magnitude, then convert its index to Hz.

let peakBin = magnitudes.indices.max(by: { magnitudes[$0] < magnitudes[$1] }) ?? 0
let peakHz = Float(peakBin) * Float(sampleRate) / Float(fftSize)

The formula is just the inverse of the bin-to-frequency mapping: frequency = binIndex * (sampleRate / fftSize). With our parameters, that's binIndex * 10.77. If the loudest bin is bin 41, the peak frequency is about 441 Hz — close to the A4 note.

The MIDI Note Formula

Showing "441 Hz" is accurate but not very musical. Most people think in note names: "that's an A." Converting a frequency to a note name uses the MIDI numbering system, which assigns every semitone on a piano keyboard a unique integer.

private func frequencyToNote(_ hz: Float) -> String {
    let notes = ["C","C#","D","D#","E","F","F#","G","G#","A","A#","B"]
    let midi = Int(round(12 * log2(hz / 440.0) + 69))
    let note = notes[((midi % 12) + 12) % 12]
    let octave = (midi / 12) - 1
    return "\(note)\(octave)"
}

This formula is worth understanding because it encodes a fundamental fact about music: pitch is logarithmic.

440 Hz = A4 = MIDI note 69. This is the reference point. A4 is the "tuning A" that orchestras tune to.
Every octave doubles the frequency and adds 12 to the MIDI number. A3 is 220 Hz (MIDI 57), A4 is 440 Hz (MIDI 69), A5 is 880 Hz (MIDI 81). Each step of 12 doubles the frequency.
12 * log2(hz / 440) gives the distance in semitones from A4. If hz is 880, that's 12 * log2(2) = 12 — exactly one octave (12 semitones) above A4.
+ 69 shifts from "distance from A4" to the absolute MIDI scale where A4 is 69.
round() snaps to the nearest semitone. A frequency of 445 Hz is slightly sharp of A4 but will round to MIDI 69 — still "A4".

The ((midi % 12) + 12) % 12 double-modulo handles negative MIDI numbers (frequencies below C0) gracefully. The note name comes from a 12-element array indexed by the position within the octave. The octave number is (midi / 12) - 1, following the convention where middle C is C4.

Tip

The MIDI numbering system was designed in the 1980s for synthesizers, but it's become the universal way to identify musical pitches in software. If you've ever used a DAW (Digital Audio Workstation), the piano roll view uses MIDI note numbers. The formula above is the standard conversion, used in everything from guitar tuners to music transcription software.

The Complete analyze() Method

Now let's combine everything into a public method that takes a raw buffer and returns all the data the UI needs:

struct AnalysisResult {
    let bars: [Float]
    let peakHz: Float
    let peakNote: String
}

mutating func analyze(buffer: [Float]) -> AnalysisResult {
    let magnitudes = process(buffer: buffer)

    let peakBin = magnitudes.indices.max(by: { magnitudes[$0] < magnitudes[$1] }) ?? 0
    let peakHz = Float(peakBin) * Float(sampleRate) / Float(fftSize)
    let peakNote = peakHz > 30 ? frequencyToNote(peakHz) : "—"

    let bars = logSpacedBars(magnitudes: magnitudes)
    return AnalysisResult(bars: bars, peakHz: peakHz, peakNote: peakNote)
}

The peakHz > 30 guard avoids showing a note name for very low frequencies (below about B0), which are usually just noise or DC offset. We display an em-dash instead.

Updating AudioEngine.swift

The SpectrumAnalyzer struct is pure DSP — it knows nothing about AVAudioEngine or SwiftUI. To use it, we need to wire it into our existing AudioEngine class. Open AudioEngine.swift and add three new published properties:

// Add these properties alongside the existing `level` property
var spectrumBars: [Float] = Array(repeating: 0, count: 48)
var peakHz: Float = 0
var peakNote: String = "—"

// The analyzer is created in start(), once we know the real sample rate
private var analyzer: SpectrumAnalyzer?

Notice that analyzer is an optional, not a pre-initialized instance. This is important — we create it inside start() where we can pass the device's actual sample rate:

func start() {
    let input = engine.inputNode
    let format = input.outputFormat(forBus: 0)

    // Create analyzer with the device's actual sample rate
    analyzer = SpectrumAnalyzer(binCount: 48, sampleRate: format.sampleRate)

    input.installTap(onBus: 0, bufferSize: 4096, format: format) { [weak self] buffer, _ in
        guard let self, let analyzer = self.analyzer else { return }

        guard let channelData = buffer.floatChannelData?[0] else { return }
        let samples = Array(UnsafeBufferPointer(start: channelData,
                                                count: Int(buffer.frameLength)))

        let rms = Self.computeRMS(samples: samples)
        let (bars, peakHz, note) = analyzer.process(buffer: samples)

        DispatchQueue.main.async {
            self.level = Self.normalize(rms)
            self.spectrumBars = bars
            self.peakHz = peakHz
            self.peakNote = note
        }
    }

    do {
        try engine.start()
        isRunning = true
    } catch {
        print("Audio engine failed to start: \(error)")
    }
}

The Sample Rate Trap

This is a real bug we hit during development. The SpectrumAnalyzer defaults to 44,100 Hz, but most iPhones actually run at 48,000 Hz. If you create the analyzer with the wrong sample rate, every frequency calculation is off by about 8% — almost a full semitone. Play an A4 (440 Hz) and the app reports G#4 instead.

The math makes it clear: peakHz = binIndex * sampleRate / fftSize. If sampleRate is 44100 but the real rate is 48000, every computed frequency is multiplied by 44100/48000 = 0.919. That's flat by 1.46 semitones — enough to land on the wrong note every time.

The fix is simple: don't assume the sample rate. Query it from the hardware with format.sampleRate and pass it to the analyzer. This is why we create the analyzer in start() rather than at init time — the format isn't available until we access the input node.

A few other things to note about this code:

guard let self, let analyzer unwraps both the weak self reference and the optional analyzer in one line. If either is nil, we bail.
DispatchQueue.main.async pushes the UI update to the main thread, just like you'd use Dispatcher.Invoke in WPF. SwiftUI requires all state changes to happen on the main thread.
The existing level property still works exactly as before. We're augmenting the audio engine, not replacing anything.

iOS Concept

The audio tap callback runs on a real-time audio thread. Allocating memory, acquiring locks, or doing anything unpredictable on this thread can cause audio glitches. Our SpectrumAnalyzer does allocate arrays internally (the samples, reals, imags, and mags arrays). For a more production-hardened implementation, you'd pre-allocate all of these as reusable buffers. For a tutorial app, the occasional allocation on a modern iPhone is fast enough not to cause audible problems.

Limitations of Peak-Bin Detection

Our peak frequency detection — finding the loudest FFT bin — works well for simple, single-tone sounds. Whistle into the mic and you'll get an accurate note reading. Play a sine wave from a tone generator app and it'll nail the frequency within half a bin width (~5 Hz).

But it's a crude technique with real limitations:

Chords and polyphony. When multiple notes play simultaneously, the "loudest bin" is just one of them. The display will jump between notes unpredictably.
Harmonics. A guitar playing A2 (110 Hz) produces strong harmonics at 220, 330, 440, 550 Hz, and beyond. Sometimes a harmonic is louder than the fundamental, and the detector reports the wrong octave.
Voice. Human speech has a complex harmonic structure. The peak bin might correspond to a formant (resonance of the vocal tract) rather than the fundamental pitch.
Noise. In a noisy room, the loudest bin might be ambient noise rather than the signal you care about.

Better pitch detection algorithms exist: autocorrelation, YIN, and pYIN are specifically designed to find the fundamental frequency even when harmonics are louder. They work in the time domain rather than the frequency domain and are much more robust for musical signals. We'll revisit this topic in Section 9.

For now, our peak-bin detector is perfectly good as a display feature. It gives the user a rough idea of what note is dominant, and it adds visual interest to the UI. Just don't try to build a guitar tuner with it.

The Complete Pipeline

Let's trace the full journey of an audio buffer from microphone to screen:

1. AVAudioEngine tap delivers 4096 Float samples | 2. Hann window applied (element-wise multiply) | 3. Samples packed into split complex format | 4. vDSP.FFT.forward() — 4096 time samples → 2048 freq bins | 5. vDSP.absolute() — complex → magnitudes | 6. logSpacedBars() — 2048 linear bins → 48 log-spaced bars | • Average bins per bar | • Convert to dB | • Normalize to 0–1 | 7. Peak detection — find loudest bin → Hz → note name | 8. DispatchQueue.main → SwiftUI picks up new bar values

Every step maps to something we discussed in Section 5. Steps 2-5 are the textbook FFT pipeline. Step 6 is the perceptual correction that makes the display match human hearing. Step 7 is a nice-to-have that gives us the note readout. Step 8 is the iOS plumbing that connects DSP to UI.

Section Summary

In this section you:

Built a SpectrumAnalyzer struct that turns raw audio into display-ready spectrum data.
Implemented the full FFT pipeline: windowing, complex packing, forward FFT, magnitude extraction.
Created log-frequency grouping that maps 2048 FFT bins to 48 perceptually-spaced display bars.
Added peak frequency detection and MIDI-based note name conversion.
Wired the analyzer into AudioEngine so the data flows from mic to model.

The data is ready. In Section 7, we'll build the SwiftUI view that displays it — animated bars, a note readout, and the final dark-themed layout that brings the whole thing to life.

Checkpoint

Your project should build without errors. If you add a temporary print(result.bars) in the audio tap callback, you should see arrays of 48 floats scrolling by in the Xcode console when audio is active. The values should be mostly near zero in a quiet room and spike toward 1.0 when you make noise. If you see that, the DSP pipeline is working. The UI comes next.