Semi-random thoughts and tales of tinkering
This is the most important section in the entire tutorial. Everything we've built so far — the VU meter, the decibel conversion, the audio tap — answers a single question: how loud is it? That's useful, but it's also flat. A whisper and a scream look different on a VU meter. A flute and a trombone playing the same note at the same volume? Identical. The meter can't tell them apart.
A spectrum analyzer can. It shows you which frequencies are present in the sound and how strong each one is. The mathematical tool that makes this possible is the Fourier Transform — specifically, its fast implementation, the FFT. This section is pure theory. No code. We'll implement everything in Section 6. Right now, the goal is to build solid intuition so that when you see the code, every line makes sense.
Think about what a microphone actually gives you. You've been working with it for several sections now: a stream of floating-point numbers, one per sample, arriving 44,100 times per second. Each number represents the air pressure at that instant. When you hum a single note, those numbers form a smooth, repeating wave. Simple enough.
But what happens when you strum a guitar? Six strings vibrate simultaneously, each at a different frequency. The microphone doesn't give you six separate signals — it gives you one signal that's the sum of all six vibrations mixed together. Add in the resonance of the guitar body, the harmonics of each string, and the ambient room noise, and you've got a complicated, messy waveform that looks nothing like a clean sine wave.
The Fourier Transform untangles that mixture. Given N samples in the time domain, it tells you the amplitude of every frequency present in the signal. It decomposes a complex waveform into its individual sinusoidal components — the pure tones that, when added together, would perfectly reconstruct the original signal.
The best physical analogy: a prism splits white light into a rainbow of individual colors. The Fourier Transform does the same thing for sound — it splits a complex audio signal into a rainbow of individual frequencies. White light is a mixture of all visible wavelengths; a guitar chord is a mixture of all audible frequencies produced by the strings. The prism and the FFT both reveal what's hiding inside the mixture.
This is a genuinely profound idea. Jean-Baptiste Fourier discovered in the early 1800s that any periodic signal — no matter how complex — can be represented as a sum of sine waves at different frequencies, amplitudes, and phases. A square wave? It's an infinite series of odd harmonics. A sawtooth wave? All harmonics, diminishing in amplitude. Your voice saying the vowel "ah"? A fundamental frequency plus a specific pattern of harmonics that your vocal tract shapes. The Fourier Transform is the mathematical machinery that finds those components.
For our purposes, the practical takeaway is simple: feed in a buffer of audio samples, get back a list of "how much energy is at each frequency." That's exactly what a spectrum analyzer needs.
Before and after the FFT, you're looking at the same data — just from two different perspectives. These perspectives have names.
This is what the microphone captures and what you've been working with so far. The x-axis is time (or sample index), and the y-axis is amplitude — the instantaneous air pressure at each moment. A buffer of 4096 samples is a time-domain signal. You can see the shape of the waveform, spot loud parts and quiet parts, but you can't easily tell which frequencies are present just by looking at it.
This is what the FFT produces. The x-axis is frequency (measured in Hz), and the y-axis is magnitude — how much energy is at that frequency. Instead of "what was the air pressure at sample 1000?", you're asking "how loud is the 440 Hz component?" This is the view that makes spectrum analyzers possible.
The time-domain view is like watching a stock ticker — you see the price moving up and down moment by moment. The frequency-domain view is like a portfolio summary — it tells you how much is invested in each sector. Same underlying data, different lens.
One critical detail: the transform is reversible. You can go from time domain to frequency domain and back again without losing any information. The FFT doesn't destroy data — it rearranges it. We only need the forward direction (time to frequency) for a spectrum analyzer, but the inverse FFT is what makes things like equalizers and noise cancellation possible: transform to frequency domain, modify the frequencies you care about, transform back.
Let's get specific about what goes in and what comes out. The acronym "FFT" stands for Fast Fourier Transform. It's not a different transform — it's just an efficient algorithm for computing the Discrete Fourier Transform (DFT). The DFT is the math; the FFT is a clever shortcut that makes it practical. The naive DFT takes O(N²) operations. The FFT does it in O(N log N). For N = 4096, that's the difference between ~16 million operations and ~49,000. This is why real-time audio analysis is possible at all.
Here's the concrete picture:
Float values captured from the microphone. Each value is an audio sample in the range roughly -1.0 to +1.0.Why N/2 and not N? For real-valued input (which audio always is), the output is symmetric around the midpoint. The second half is a mirror image of the first half, so it carries no new information. Implementations typically only return the first N/2 bins.
Each output is a complex number with real and imaginary components. To get the magnitude (how loud that frequency is), you compute sqrt(real² + imaginary²). The phase (the timing offset of that frequency component) is atan2(imaginary, real). For a spectrum analyzer, we only care about magnitude — phase is irrelevant for visualization.
If complex numbers make your eyes glaze over, don't worry. You never have to manipulate them directly. Apple's Accelerate framework handles the complex arithmetic internally. You feed in floats, you get back magnitudes. The complex numbers are an implementation detail, not something you need to reason about.
The N/2 output values are called bins. Each bin represents a narrow band of frequencies. The width of each bin — its frequency resolution — is determined by a simple formula:
Bin spacing = sampleRate / N
With our parameters (44,100 Hz sample rate, 4096-sample FFT):
That maximum frequency, sampleRate / 2, is the Nyquist frequency. It's the hard ceiling on what frequencies the FFT can detect, and it's set by the sample rate, not the FFT size. If you've read the earlier sections, you've already encountered Nyquist in the context of sampling theory. Here it shows up again: the FFT can't tell you about frequencies above Nyquist because the original samples don't contain that information.
So with a 4096-point FFT at 44.1 kHz, we get 2048 bins spanning 0 Hz to ~22 kHz, each about 10.77 Hz wide. That's enough resolution to distinguish individual musical notes across most of the audible range — the distance between adjacent piano keys in the middle octaves is around 15-30 Hz, so our bins are narrow enough to separate them.
It's tempting to think of each bin as detecting a single exact frequency. It's more accurate to think of each bin as a narrow bandpass filter. Bin 40, centered around 430.66 Hz, responds to energy anywhere in the range roughly 425-436 Hz. A pure 440 Hz tone will register primarily in bin 41 (around 441 Hz), with some spillover into neighboring bins. This blurring is inherent to the FFT, and it's closely related to the next topic: spectral leakage.
Here's a problem. The FFT's math assumes the input signal repeats forever — that your 4096 samples are one period of an infinitely repeating waveform. But they're not. They're a finite chunk snipped out of a continuous audio stream. Unless you get astronomically lucky, the signal level at the start of your buffer won't match the signal level at the end. When the FFT tries to treat this chunk as repeating, those mismatched edges create a sharp discontinuity — a sudden jump that looks like the signal is slamming from one value to another between the last sample and the first.
That sharp discontinuity is not actually in the audio. But the FFT doesn't know that. It faithfully reports the frequency content of the signal including the fake high-frequency energy caused by the sharp edge. This phantom energy spreads across the entire spectrum, smearing what should be clean, sharp peaks into broad, noisy humps. This artifact is called spectral leakage.
The solution is elegant: before running the FFT, multiply your samples by a window function — a smooth curve that starts at zero, rises to a peak in the middle, and falls back to zero at the edges. This tapers the signal so that both ends are near zero, eliminating the discontinuity entirely. No sharp edge, no fake high-frequency content, no leakage.
The most common window function for audio analysis is the Hann window (sometimes incorrectly called "Hanning"). It's shaped like a single period of a cosine curve, shifted and scaled so it sits between 0 and 1. It looks like a bell or an arch. Mathematically, for sample index n in a window of length N:
w(n) = 0.5 * (1 - cos(2πn / N))
You don't need to implement this yourself — Apple's Accelerate framework provides it as a one-liner. But understanding why it exists matters.
The windowing operation is just element-wise multiplication. Sample 0 gets multiplied by ~0 (the window is zero at the edges). Sample 2048 (the middle) gets multiplied by 1.0 (the window peaks at the center). Every sample between is scaled by the window's smooth curve. The result: a buffer that smoothly fades in and fades out, with most of the energy concentrated in the middle where the window is strongest.
Nothing is free. Windowing dramatically reduces spectral leakage, but it comes at a cost: slightly reduced frequency resolution. The peaks in your spectrum become a bit wider, making it marginally harder to distinguish two very close frequencies. Think of it as a slider between "sharp peaks with noisy leakage" and "clean peaks that are slightly blurred." For a visual spectrum analyzer, the clean-but-slightly-blurred option is overwhelmingly better. The leakage from an un-windowed FFT would make your display look like noise.
There are dozens of window functions beyond Hann — Hamming, Blackman, Kaiser, flat-top — each with slightly different tradeoffs between peak width and leakage suppression. Hann is the default choice for spectrum analyzers because it offers a good balance for visual display. If you were building a precision measurement tool, you might agonize over the choice. For our purposes, Hann is perfect. Use it and move on.
If you want to build intuition, think of windowing as a "focus" operation. Without it, you're asking the FFT to analyze the entire buffer equally, including the messy edges. With it, you're telling the FFT "focus on the middle of this buffer where the signal is cleanest, and gradually ignore the edges." You lose a bit of data at the margins, but the data you keep is much more accurate.
The FFT size (N) is the single most important parameter you'll choose for your spectrum analyzer. It controls two things that pull in opposite directions: frequency resolution and time resolution (latency).
More samples in = more bins out = finer frequency detail. The formula again: bin spacing = sampleRate / N.
| FFT Size (N) | Bins (N/2) | Bin Spacing | Can Distinguish Notes... |
|---|---|---|---|
| 512 | 256 | 86.1 Hz | Only octaves apart |
| 1024 | 512 | 43.1 Hz | A few semitones apart |
| 2048 | 1024 | 21.5 Hz | Adjacent notes (above ~200 Hz) |
| 4096 | 2048 | 10.8 Hz | Adjacent notes across most of the range |
| 8192 | 4096 | 5.4 Hz | Fine pitch differences |
Here's the catch. To fill a buffer of N samples at 44,100 samples/second, you have to wait for N/44100 seconds of audio to arrive. Bigger buffer = longer wait.
| FFT Size (N) | Buffer Duration | Feel |
|---|---|---|
| 512 | ~12 ms | Instantaneous — but blurry spectrum |
| 1024 | ~23 ms | Very responsive |
| 2048 | ~46 ms | Quick, slightly soft |
| 4096 | ~93 ms | Responsive enough for visualization |
| 8192 | ~186 ms | Noticeable lag — but razor-sharp spectrum |
This is a fundamental tradeoff in signal processing, sometimes called the time-frequency uncertainty principle. You cannot have both perfect frequency resolution and perfect time resolution simultaneously. A longer window gives you a sharper view of frequency but a blurrier view of time. A shorter window gives you precise timing but fuzzy frequency.
This tradeoff isn't a limitation of the FFT algorithm — it's a fundamental property of signals. It's closely related to the Heisenberg uncertainty principle in physics (you can't know both position and momentum precisely). In the signal processing world, you can't know both "when" and "what frequency" with arbitrary precision. The FFT size is how you choose your compromise point.
For a spectrum analyzer displayed on a phone screen, 4096 is the sweet spot. Here's why:
For a guitar tuner, you'd want 8192 or even 16384 to resolve the fine pitch differences between in-tune and slightly-sharp. For a beat detector, you'd want 512 or 1024 for fast response to transients. The right FFT size depends entirely on what you're building.
Let's walk through the complete pipeline that will turn a microphone signal into a spectrum display. No code yet — just the sequence of operations. In Section 6, we'll implement each of these steps.
AVAudioEngine's tap.vDSP.FFT. Out come 2048 complex numbers.sqrt(real² + imag²) to get the amplitude at that frequency. Normalize by dividing by N.Steps 1 and 7 are iOS (AVAudioEngine and SwiftUI). Steps 2 through 6 are DSP. Step 5 is where human perception meets math. This is the complete pipeline for a real-time spectrum analyzer, and it's what professional audio apps do. The only difference between our tutorial app and a production spectrum analyzer is polish: peak hold, averaging, more sophisticated windowing, maybe overlap-add processing. The core pipeline is the same.
You might be thinking: surely there's a library that does all of this in one function call. And there is — Apple's Accelerate framework handles the heavy lifting. But we're using it as a toolkit, not a black box. Understanding what the FFT does and why windowing matters will save you hours of debugging when your spectrum display looks wrong. "The bars are all noisy" — did you forget windowing? "I can't distinguish close notes" — is your FFT size too small? "The display feels laggy" — is it too large? These are questions you can only answer if you understand the theory.
Besides, the theory is beautiful. You're about to take a messy stream of air-pressure measurements and decompose it into pure tones. That's genuinely amazing, and it'll take about 30 lines of Swift.
Here's what to hold in your head going into Section 6:
With this foundation, the code in Section 6 will be refreshingly straightforward. Every line maps directly to one of the concepts above. Let's go build it.
This was a theory section — nothing to build and run. But you should be able to answer these questions: What does the FFT transform from and to? Why do we apply a window function before the FFT? What determines the frequency resolution of the output? If you can answer those three, you're ready for Section 6.