Semi-random thoughts and tales of tinkering
Section 5 gave you the theory. Now we write the code. By the end of this section, you'll have a SpectrumAnalyzer struct that takes a buffer of raw audio samples and returns an array of bar heights ready for display. Every line maps directly to a concept from the previous section — windowing, FFT, magnitude computation, log-frequency grouping. If anything feels unfamiliar, flip back to Section 5.
We'll also add peak frequency detection and a MIDI-based note name converter, then wire the whole thing into AudioEngine. At the end, the audio engine will publish spectrum data alongside the VU level you've already built.
Create a new Swift file in Xcode: File → New → File → Swift File, name it SpectrumAnalyzer.swift. Here's the struct and its initializer:
import Accelerate
struct SpectrumAnalyzer {
let binCount: Int
let sampleRate: Double
private let fftSize: Int
private let halfSize: Int
private var fftSetup: vDSP.FFT<DSPSplitComplex>?
private var window: [Float]
init(binCount: Int = 48, sampleRate: Double = 44100, fftSize: Int = 4096) {
self.binCount = binCount
self.sampleRate = sampleRate
self.fftSize = fftSize
self.halfSize = fftSize / 2
self.window = vDSP.window(ofType: Float.self,
usingSequence: .hanningDenormalized,
count: fftSize,
isHalfWindow: false)
let log2n = vDSP_Length(log2(Float(fftSize)))
self.fftSetup = vDSP.FFT(log2n: log2n,
radix: .radix2,
ofType: DSPSplitComplex.self)
}
}
Let's walk through each piece.
binCount is the number of bars in our display — 48 by default. That's enough to show meaningful frequency detail without making each bar too thin on a phone screen. sampleRate must match the audio hardware's actual rate — the default of 44,100 Hz is a placeholder; we'll pass the real value from AudioEngine when we wire things together. fftSize is 4096 — the sweet spot we discussed in Section 5.
halfSize is fftSize / 2 (2048). We'll use this everywhere — it's the number of frequency bins the FFT produces.
self.window = vDSP.window(ofType: Float.self,
usingSequence: .hanningDenormalized,
count: fftSize,
isHalfWindow: false)
This pre-computes the 4096-element Hann window we described in Section 5. The Accelerate framework generates the array of window coefficients for us — no manual cosine math needed. We store it as a property because the window never changes: we'll multiply it against every incoming buffer, and recomputing it each time would be wasteful.
.hanningDenormalized is Accelerate's name for the standard Hann window. The "denormalized" part means the values range from 0 to 1 without any energy-correction scaling. isHalfWindow: false means we want the full symmetric window, not just the first half.
let log2n = vDSP_Length(log2(Float(fftSize)))
self.fftSetup = vDSP.FFT(log2n: log2n,
radix: .radix2,
ofType: DSPSplitComplex.self)
This creates an FFT "plan" — a pre-computed data structure that Accelerate uses to execute the FFT efficiently. Think of it like compiling a regular expression before using it in a loop: you pay the setup cost once, then every subsequent FFT runs faster because the plan already knows the optimal memory layout and butterfly operations for this size.
The log2n parameter is the base-2 logarithm of the FFT size. For 4096, that's 12. Accelerate requires FFT sizes that are powers of 2 (hence "radix 2"), which is why we chose 4096 and not, say, 4000.
Pre-computing the window and FFT setup in init is important for real-time performance. Audio buffers arrive every ~93ms. If we had to allocate arrays and build FFT plans on every callback, we'd introduce jitter and potentially drop frames. By doing all allocation upfront, the per-buffer processing path is fast and allocation-free. This is the same principle as object pooling in C# game development — allocate once, reuse forever.
This is the core of the spectrum analyzer. The process method takes a raw audio buffer and returns an array of magnitudes — one per FFT bin. We'll build it in four numbered steps that correspond directly to the pipeline from Section 5.
mutating func process(buffer: [Float]) -> [Float] {
// Step 1: Pad to fftSize if needed, then apply Hann window
var samples = Array(buffer.prefix(fftSize))
if samples.count < fftSize {
samples += Array(repeating: 0, count: fftSize - samples.count)
}
vDSP.multiply(samples, window, result: &samples)
First, we take up to fftSize samples from the incoming buffer. If the buffer is shorter than 4096 (which can happen at startup or with certain audio configurations), we pad it with zeros. Zero-padding doesn't add information, but it ensures the FFT always gets the size it expects.
Then comes the windowing: vDSP.multiply performs element-wise multiplication of the samples with our pre-computed Hann window. Sample 0 gets multiplied by ~0 (the window is near-zero at the edges). Sample 2048 gets multiplied by ~1.0 (the window peaks at the center). This is the spectral leakage prevention from Section 5, implemented as a single function call.
In C#, element-wise array multiplication would look like for (int i = 0; i < N; i++) samples[i] *= window[i]; or maybe a LINQ Zip. The vDSP.multiply call does the same thing but uses SIMD instructions under the hood — it processes 4 or 8 floats per CPU cycle. For 4096 elements, this is essentially free.
// Step 2: Pack into split complex format for vDSP
var reals = [Float](repeating: 0, count: halfSize)
var imags = [Float](repeating: 0, count: halfSize)
let magnitudes: [Float] = reals.withUnsafeMutableBufferPointer { realsBP in
imags.withUnsafeMutableBufferPointer { imagsBP in
var splitComplex = DSPSplitComplex(realp: realsBP.baseAddress!,
imagp: imagsBP.baseAddress!)
samples.withUnsafeBytes { ptr in
let floatPtr = ptr.bindMemory(to: DSPComplex.self)
vDSP_ctoz(floatPtr.baseAddress!, 2, &splitComplex, 1,
vDSP_Length(halfSize))
}
This is the gnarliest part of the code, so let's take it slowly.
Accelerate's FFT doesn't work with a plain array of floats. It wants data in split complex format — separate arrays for real parts and imaginary parts. For our real-valued input, the "imaginary" parts start as zeros. The vDSP_ctoz function converts from interleaved format (alternating real/imaginary pairs) to split format (one array of reals, one array of imaginaries).
All the withUnsafeMutableBufferPointer and withUnsafeBytes wrappers are Swift's safety system making you explicitly opt into raw pointer access. Swift doesn't let you take a pointer to an array's memory without these wrappers — they guarantee the array stays alive and pinned in memory for the duration of the closure. If you've used C#'s fixed statement to pin a managed array for P/Invoke, this is the same idea. If you've used Unsafe.As<T> or Span<T> with MemoryMarshal, same family of concepts.
The ! after baseAddress is a force-unwrap. baseAddress returns an optional pointer that's nil only if the buffer is empty — and we just created non-empty arrays, so this is safe. In production code, you might add a guard, but here we know the sizes are correct.
The withUnsafe* pattern is Swift's way of saying: "I know you need raw pointers for this C-level API. I'll let you have them, but only inside this scope, and I'll manage the memory lifetime for you." It's verbose but safe. Once you've written it a couple of times, it becomes muscle memory. The Accelerate framework is a C API with a Swift overlay, so pointer wrangling at the boundary is unavoidable.
// Step 3: Execute the FFT
fftSetup?.forward(input: splitComplex, output: &splitComplex)
One line. All the theory from Section 5 — the decomposition of a time-domain signal into frequency components, the O(N log N) butterfly algorithm, the complex exponentials — happens right here. The 4096 windowed samples go in, and the split complex arrays now contain 2048 frequency bins, each with real and imaginary parts.
We're doing the FFT in-place (output: &splitComplex points to the same memory as the input). This saves an allocation. The reals and imags arrays, which started as our packed input, now hold the FFT output.
// Step 4: Compute magnitudes and normalize
var mags = [Float](repeating: 0, count: halfSize)
vDSP.absolute(splitComplex, result: &mags)
vDSP.multiply(1.0 / Float(fftSize), mags, result: &mags)
return mags
}
}
}
vDSP.absolute computes sqrt(real² + imag²) for each bin — that's the magnitude, which tells us the amplitude of each frequency component. We then divide by fftSize to normalize the values. Without normalization, the magnitudes scale with the FFT size, and doubling N would double all your values even though the signal didn't change.
The result is an array of 2048 floats. Each float represents the normalized amplitude at its corresponding frequency bin. Bin 0 is 0 Hz (DC), bin 1 is ~10.77 Hz, bin 2 is ~21.5 Hz, and so on up to bin 2047 at ~22,050 Hz (Nyquist).
That's the complete FFT pipeline in four steps. Take a buffer, window it, FFT it, compute magnitudes. About 15 lines of actual logic, wrapped in the pointer-safety boilerplate that Swift requires for C interop.
We now have 2048 linearly-spaced frequency bins. We could display them directly — one thin bar per bin, 2048 bars total. That would be technically accurate and visually useless. Here's why.
FFT bins are linearly spaced: each bin is ~10.77 Hz wide. But human hearing is logarithmic. We perceive the distance between 100 Hz and 200 Hz (one octave) as the same as the distance between 1000 Hz and 2000 Hz (also one octave), even though the second span is ten times wider in Hz. On a linear frequency scale:
If you gave each bin equal visual weight, bass would be a thin sliver on the left and treble would dominate the entire display. That's the opposite of how we hear. Log spacing fixes this by giving each octave roughly equal visual width.
Add this method to SpectrumAnalyzer:
private func logSpacedBars(magnitudes: [Float]) -> [Float] {
let minFreq: Float = 60
let maxFreq: Float = 18000
let logMin = log10(minFreq)
let logMax = log10(maxFreq)
return (0..<binCount).map { i in
let logLow = logMin + Float(i) / Float(binCount) * (logMax - logMin)
let logHigh = logMin + Float(i + 1) / Float(binCount) * (logMax - logMin)
let freqLow = pow(10, logLow)
let freqHigh = pow(10, logHigh)
let binLow = Int(freqLow / Float(sampleRate) * Float(fftSize))
let binHigh = Int(freqHigh / Float(sampleRate) * Float(fftSize))
let slice = magnitudes[max(0, binLow)...min(magnitudes.count - 1, max(binLow, binHigh))]
let rms = slice.isEmpty ? 0 : slice.reduce(0, +) / Float(slice.count)
let db = 20 * log10(max(rms, 1e-9))
let normalized = (db + 80) / 80
return max(0, min(1, normalized))
}
}
Let's trace through the logic.
We define a display range of 60 Hz to 18,000 Hz. Below 60 Hz there's mostly rumble and DC offset. Above 18,000 Hz, most adults can't hear anything. These boundaries become the left and right edges of our spectrum display.
The key trick: we divide the frequency range in log space, not linear space. log10(60) ≈ 1.78, log10(18000) ≈ 4.26. We divide this log range into 48 equal slices. Each slice covers the same ratio of frequencies. Bar 0 might cover 60-72 Hz. Bar 47 might cover 15,000-18,000 Hz. In linear Hz, bar 47's range is 250 times wider than bar 0's. In perceptual terms, they're similar — each is roughly a fraction of an octave.
For each display bar, we convert its frequency bounds to FFT bin indices: bin = frequency * fftSize / sampleRate. This tells us which FFT bins fall within this bar's frequency range. We then take the average magnitude of those bins.
For the lowest bars, this might be just 1-2 FFT bins. For the highest bars, it could be dozens or hundreds. That's fine — we're averaging them down to a single value per bar.
The raw magnitudes from the FFT span a huge dynamic range. A quiet room might produce magnitudes of 0.0001 while a loud clap produces 0.5. Displaying these linearly would mean quiet sounds are invisible. So we convert to decibels: 20 * log10(magnitude). This compresses the range to something manageable.
The (db + 80) / 80 normalization maps the range -80 dB to 0 dB onto the range 0.0 to 1.0. Anything below -80 dB (effectively silence) clips to 0. Anything at 0 dB (full scale) maps to 1.0. The result is a value we can directly use as a bar height.
The max(rms, 1e-9) guard prevents log10(0), which would be negative infinity. A magnitude of 1e-9 corresponds to about -180 dB — so far below the noise floor it's irrelevant.
The 80 dB display range is a common choice for spectrum analyzers. It means the tallest bar is about 10,000 times louder (in amplitude) than the shortest visible bar. Professional audio tools sometimes use 90 or 120 dB ranges, but 80 dB is a good balance for a phone screen where you want quiet content to still be visible.
Finding the dominant frequency in the spectrum is straightforward: find the bin with the highest magnitude, then convert its index to Hz.
let peakBin = magnitudes.indices.max(by: { magnitudes[$0] < magnitudes[$1] }) ?? 0
let peakHz = Float(peakBin) * Float(sampleRate) / Float(fftSize)
The formula is just the inverse of the bin-to-frequency mapping: frequency = binIndex * (sampleRate / fftSize). With our parameters, that's binIndex * 10.77. If the loudest bin is bin 41, the peak frequency is about 441 Hz — close to the A4 note.
Showing "441 Hz" is accurate but not very musical. Most people think in note names: "that's an A." Converting a frequency to a note name uses the MIDI numbering system, which assigns every semitone on a piano keyboard a unique integer.
private func frequencyToNote(_ hz: Float) -> String {
let notes = ["C","C#","D","D#","E","F","F#","G","G#","A","A#","B"]
let midi = Int(round(12 * log2(hz / 440.0) + 69))
let note = notes[((midi % 12) + 12) % 12]
let octave = (midi / 12) - 1
return "\(note)\(octave)"
}
This formula is worth understanding because it encodes a fundamental fact about music: pitch is logarithmic.
12 * log2(hz / 440) gives the distance in semitones from A4. If hz is 880, that's 12 * log2(2) = 12 — exactly one octave (12 semitones) above A4.+ 69 shifts from "distance from A4" to the absolute MIDI scale where A4 is 69.round() snaps to the nearest semitone. A frequency of 445 Hz is slightly sharp of A4 but will round to MIDI 69 — still "A4".The ((midi % 12) + 12) % 12 double-modulo handles negative MIDI numbers (frequencies below C0) gracefully. The note name comes from a 12-element array indexed by the position within the octave. The octave number is (midi / 12) - 1, following the convention where middle C is C4.
The MIDI numbering system was designed in the 1980s for synthesizers, but it's become the universal way to identify musical pitches in software. If you've ever used a DAW (Digital Audio Workstation), the piano roll view uses MIDI note numbers. The formula above is the standard conversion, used in everything from guitar tuners to music transcription software.
Now let's combine everything into a public method that takes a raw buffer and returns all the data the UI needs:
struct AnalysisResult {
let bars: [Float]
let peakHz: Float
let peakNote: String
}
mutating func analyze(buffer: [Float]) -> AnalysisResult {
let magnitudes = process(buffer: buffer)
let peakBin = magnitudes.indices.max(by: { magnitudes[$0] < magnitudes[$1] }) ?? 0
let peakHz = Float(peakBin) * Float(sampleRate) / Float(fftSize)
let peakNote = peakHz > 30 ? frequencyToNote(peakHz) : "—"
let bars = logSpacedBars(magnitudes: magnitudes)
return AnalysisResult(bars: bars, peakHz: peakHz, peakNote: peakNote)
}
The peakHz > 30 guard avoids showing a note name for very low frequencies (below about B0), which are usually just noise or DC offset. We display an em-dash instead.
The SpectrumAnalyzer struct is pure DSP — it knows nothing about AVAudioEngine or SwiftUI. To use it, we need to wire it into our existing AudioEngine class. Open AudioEngine.swift and add three new published properties:
// Add these properties alongside the existing `level` property
var spectrumBars: [Float] = Array(repeating: 0, count: 48)
var peakHz: Float = 0
var peakNote: String = "—"
// The analyzer is created in start(), once we know the real sample rate
private var analyzer: SpectrumAnalyzer?
Notice that analyzer is an optional, not a pre-initialized instance. This is important — we create it inside start() where we can pass the device's actual sample rate:
func start() {
let input = engine.inputNode
let format = input.outputFormat(forBus: 0)
// Create analyzer with the device's actual sample rate
analyzer = SpectrumAnalyzer(binCount: 48, sampleRate: format.sampleRate)
input.installTap(onBus: 0, bufferSize: 4096, format: format) { [weak self] buffer, _ in
guard let self, let analyzer = self.analyzer else { return }
guard let channelData = buffer.floatChannelData?[0] else { return }
let samples = Array(UnsafeBufferPointer(start: channelData,
count: Int(buffer.frameLength)))
let rms = Self.computeRMS(samples: samples)
let (bars, peakHz, note) = analyzer.process(buffer: samples)
DispatchQueue.main.async {
self.level = Self.normalize(rms)
self.spectrumBars = bars
self.peakHz = peakHz
self.peakNote = note
}
}
do {
try engine.start()
isRunning = true
} catch {
print("Audio engine failed to start: \(error)")
}
}
This is a real bug we hit during development. The SpectrumAnalyzer defaults to 44,100 Hz, but most iPhones actually run at 48,000 Hz. If you create the analyzer with the wrong sample rate, every frequency calculation is off by about 8% — almost a full semitone. Play an A4 (440 Hz) and the app reports G#4 instead.
The math makes it clear: peakHz = binIndex * sampleRate / fftSize. If sampleRate is 44100 but the real rate is 48000, every computed frequency is multiplied by 44100/48000 = 0.919. That's flat by 1.46 semitones — enough to land on the wrong note every time.
The fix is simple: don't assume the sample rate. Query it from the hardware with format.sampleRate and pass it to the analyzer. This is why we create the analyzer in start() rather than at init time — the format isn't available until we access the input node.
A few other things to note about this code:
guard let self, let analyzer unwraps both the weak self reference and the optional analyzer in one line. If either is nil, we bail.DispatchQueue.main.async pushes the UI update to the main thread, just like you'd use Dispatcher.Invoke in WPF. SwiftUI requires all state changes to happen on the main thread.level property still works exactly as before. We're augmenting the audio engine, not replacing anything.The audio tap callback runs on a real-time audio thread. Allocating memory, acquiring locks, or doing anything unpredictable on this thread can cause audio glitches. Our SpectrumAnalyzer does allocate arrays internally (the samples, reals, imags, and mags arrays). For a more production-hardened implementation, you'd pre-allocate all of these as reusable buffers. For a tutorial app, the occasional allocation on a modern iPhone is fast enough not to cause audible problems.
Our peak frequency detection — finding the loudest FFT bin — works well for simple, single-tone sounds. Whistle into the mic and you'll get an accurate note reading. Play a sine wave from a tone generator app and it'll nail the frequency within half a bin width (~5 Hz).
But it's a crude technique with real limitations:
Better pitch detection algorithms exist: autocorrelation, YIN, and pYIN are specifically designed to find the fundamental frequency even when harmonics are louder. They work in the time domain rather than the frequency domain and are much more robust for musical signals. We'll revisit this topic in Section 9.
For now, our peak-bin detector is perfectly good as a display feature. It gives the user a rough idea of what note is dominant, and it adds visual interest to the UI. Just don't try to build a guitar tuner with it.
Let's trace the full journey of an audio buffer from microphone to screen:
Every step maps to something we discussed in Section 5. Steps 2-5 are the textbook FFT pipeline. Step 6 is the perceptual correction that makes the display match human hearing. Step 7 is a nice-to-have that gives us the note readout. Step 8 is the iOS plumbing that connects DSP to UI.
In this section you:
SpectrumAnalyzer struct that turns raw audio into display-ready spectrum data.AudioEngine so the data flows from mic to model.The data is ready. In Section 7, we'll build the SwiftUI view that displays it — animated bars, a note readout, and the final dark-themed layout that brings the whole thing to life.
Your project should build without errors. If you add a temporary print(result.bars) in the audio tap callback, you should see arrays of 48 floats scrolling by in the Xcode console when audio is active. The values should be mostly near zero in a quiet room and spike toward 1.0 when you make noise. If you see that, the DSP pipeline is working. The UI comes next.