5. The Fourier Transform DSP

This is the most important section in the entire tutorial. Everything we've built so far — the VU meter, the decibel conversion, the audio tap — answers a single question: how loud is it? That's useful, but it's also flat. A whisper and a scream look different on a VU meter. A flute and a trombone playing the same note at the same volume? Identical. The meter can't tell them apart.

A spectrum analyzer can. It shows you which frequencies are present in the sound and how strong each one is. The mathematical tool that makes this possible is the Fourier Transform — specifically, its fast implementation, the FFT. This section is pure theory. No code. We'll implement everything in Section 6. Right now, the goal is to build solid intuition so that when you see the code, every line makes sense.

The Key Insight

Think about what a microphone actually gives you. You've been working with it for several sections now: a stream of floating-point numbers, one per sample, arriving 44,100 times per second. Each number represents the air pressure at that instant. When you hum a single note, those numbers form a smooth, repeating wave. Simple enough.

But what happens when you strum a guitar? Six strings vibrate simultaneously, each at a different frequency. The microphone doesn't give you six separate signals — it gives you one signal that's the sum of all six vibrations mixed together. Add in the resonance of the guitar body, the harmonics of each string, and the ambient room noise, and you've got a complicated, messy waveform that looks nothing like a clean sine wave.

The Fourier Transform untangles that mixture. Given N samples in the time domain, it tells you the amplitude of every frequency present in the signal. It decomposes a complex waveform into its individual sinusoidal components — the pure tones that, when added together, would perfectly reconstruct the original signal.

DSP Concept

The best physical analogy: a prism splits white light into a rainbow of individual colors. The Fourier Transform does the same thing for sound — it splits a complex audio signal into a rainbow of individual frequencies. White light is a mixture of all visible wavelengths; a guitar chord is a mixture of all audible frequencies produced by the strings. The prism and the FFT both reveal what's hiding inside the mixture.

This is a genuinely profound idea. Jean-Baptiste Fourier discovered in the early 1800s that any periodic signal — no matter how complex — can be represented as a sum of sine waves at different frequencies, amplitudes, and phases. A square wave? It's an infinite series of odd harmonics. A sawtooth wave? All harmonics, diminishing in amplitude. Your voice saying the vowel "ah"? A fundamental frequency plus a specific pattern of harmonics that your vocal tract shapes. The Fourier Transform is the mathematical machinery that finds those components.

For our purposes, the practical takeaway is simple: feed in a buffer of audio samples, get back a list of "how much energy is at each frequency." That's exactly what a spectrum analyzer needs.

Time Domain vs Frequency Domain

Before and after the FFT, you're looking at the same data — just from two different perspectives. These perspectives have names.

The Time Domain

This is what the microphone captures and what you've been working with so far. The x-axis is time (or sample index), and the y-axis is amplitude — the instantaneous air pressure at each moment. A buffer of 4096 samples is a time-domain signal. You can see the shape of the waveform, spot loud parts and quiet parts, but you can't easily tell which frequencies are present just by looking at it.

The Frequency Domain

This is what the FFT produces. The x-axis is frequency (measured in Hz), and the y-axis is magnitude — how much energy is at that frequency. Instead of "what was the air pressure at sample 1000?", you're asking "how loud is the 440 Hz component?" This is the view that makes spectrum analyzers possible.

Time Domain Frequency Domain amplitude magnitude | /\ /\ | | / \ / \ | | | | / \/ \ —— FFT ——> | | | | | / \ | | | | | | | / \ | | | | | | +—————————————————> time +—————————————————> Hz samples (e.g. 4096) bins (e.g. 2048)

The time-domain view is like watching a stock ticker — you see the price moving up and down moment by moment. The frequency-domain view is like a portfolio summary — it tells you how much is invested in each sector. Same underlying data, different lens.

One critical detail: the transform is reversible. You can go from time domain to frequency domain and back again without losing any information. The FFT doesn't destroy data — it rearranges it. We only need the forward direction (time to frequency) for a spectrum analyzer, but the inverse FFT is what makes things like equalizers and noise cancellation possible: transform to frequency domain, modify the frequencies you care about, transform back.

What the FFT Actually Computes

Let's get specific about what goes in and what comes out. The acronym "FFT" stands for Fast Fourier Transform. It's not a different transform — it's just an efficient algorithm for computing the Discrete Fourier Transform (DFT). The DFT is the math; the FFT is a clever shortcut that makes it practical. The naive DFT takes O(N²) operations. The FFT does it in O(N log N). For N = 4096, that's the difference between ~16 million operations and ~49,000. This is why real-time audio analysis is possible at all.

Input and Output

Here's the concrete picture:

Input: N real-valued samples. For us, that's a buffer of 4096 Float values captured from the microphone. Each value is an audio sample in the range roughly -1.0 to +1.0.
Output: N/2 complex numbers. Each complex number has a real part and an imaginary part. We get 2048 output values, each representing a "frequency bin."

Why N/2 and not N? For real-valued input (which audio always is), the output is symmetric around the midpoint. The second half is a mirror image of the first half, so it carries no new information. Implementations typically only return the first N/2 bins.

Each output is a complex number with real and imaginary components. To get the magnitude (how loud that frequency is), you compute sqrt(real² + imaginary²). The phase (the timing offset of that frequency component) is atan2(imaginary, real). For a spectrum analyzer, we only care about magnitude — phase is irrelevant for visualization.

Tip

If complex numbers make your eyes glaze over, don't worry. You never have to manipulate them directly. Apple's Accelerate framework handles the complex arithmetic internally. You feed in floats, you get back magnitudes. The complex numbers are an implementation detail, not something you need to reason about.

Frequency Bins

The N/2 output values are called bins. Each bin represents a narrow band of frequencies. The width of each bin — its frequency resolution — is determined by a simple formula:

Bin spacing = sampleRate / N

With our parameters (44,100 Hz sample rate, 4096-sample FFT):

Bin spacing = 44100 / 4096 ≈ 10.77 Hz per bin
Bin 0 = 0 Hz (the DC offset — the average value of the signal, usually near zero for audio)
Bin 1 ≈ 10.77 Hz
Bin 2 ≈ 21.53 Hz
Bin 40 ≈ 430.66 Hz (close to the A4 note at 440 Hz)
Bin 2047 ≈ 22,050 Hz (the maximum — half the sample rate)

That maximum frequency, sampleRate / 2, is the Nyquist frequency. It's the hard ceiling on what frequencies the FFT can detect, and it's set by the sample rate, not the FFT size. If you've read the earlier sections, you've already encountered Nyquist in the context of sampling theory. Here it shows up again: the FFT can't tell you about frequencies above Nyquist because the original samples don't contain that information.

So with a 4096-point FFT at 44.1 kHz, we get 2048 bins spanning 0 Hz to ~22 kHz, each about 10.77 Hz wide. That's enough resolution to distinguish individual musical notes across most of the audible range — the distance between adjacent piano keys in the middle octaves is around 15-30 Hz, so our bins are narrow enough to separate them.

What a Bin Really Means

It's tempting to think of each bin as detecting a single exact frequency. It's more accurate to think of each bin as a narrow bandpass filter. Bin 40, centered around 430.66 Hz, responds to energy anywhere in the range roughly 425-436 Hz. A pure 440 Hz tone will register primarily in bin 41 (around 441 Hz), with some spillover into neighboring bins. This blurring is inherent to the FFT, and it's closely related to the next topic: spectral leakage.

Window Functions and Spectral Leakage

Here's a problem. The FFT's math assumes the input signal repeats forever — that your 4096 samples are one period of an infinitely repeating waveform. But they're not. They're a finite chunk snipped out of a continuous audio stream. Unless you get astronomically lucky, the signal level at the start of your buffer won't match the signal level at the end. When the FFT tries to treat this chunk as repeating, those mismatched edges create a sharp discontinuity — a sudden jump that looks like the signal is slamming from one value to another between the last sample and the first.

That sharp discontinuity is not actually in the audio. But the FFT doesn't know that. It faithfully reports the frequency content of the signal including the fake high-frequency energy caused by the sharp edge. This phantom energy spreads across the entire spectrum, smearing what should be clean, sharp peaks into broad, noisy humps. This artifact is called spectral leakage.

Raw buffer (edges don't match): sample 1 sample 4096 | | v v | /\ /\ /\ /\ /\ /\ /\ /\ | | / \ / \ / \ / \ / \ / \ / \ / | | / \/ \/ \/ \/ \/ \/ \/ | |/ /| ^--- sharp jump when buffer "repeats" ---^

The solution is elegant: before running the FFT, multiply your samples by a window function — a smooth curve that starts at zero, rises to a peak in the middle, and falls back to zero at the edges. This tapers the signal so that both ends are near zero, eliminating the discontinuity entirely. No sharp edge, no fake high-frequency content, no leakage.

The most common window function for audio analysis is the Hann window (sometimes incorrectly called "Hanning"). It's shaped like a single period of a cosine curve, shifted and scaled so it sits between 0 and 1. It looks like a bell or an arch. Mathematically, for sample index n in a window of length N:

w(n) = 0.5 * (1 - cos(2πn / N))

You don't need to implement this yourself — Apple's Accelerate framework provides it as a one-liner. But understanding why it exists matters.

Raw buffer: Hann window: Windowed buffer: |██████████████████| | ╱━━━━╲ | | ▄██████████▄ | ^ sharp edges ^ smooth taper smooth edges (zero at ends) (no discontinuity)

The windowing operation is just element-wise multiplication. Sample 0 gets multiplied by ~0 (the window is zero at the edges). Sample 2048 (the middle) gets multiplied by 1.0 (the window peaks at the center). Every sample between is scaled by the window's smooth curve. The result: a buffer that smoothly fades in and fades out, with most of the energy concentrated in the middle where the window is strongest.

The Tradeoff

Nothing is free. Windowing dramatically reduces spectral leakage, but it comes at a cost: slightly reduced frequency resolution. The peaks in your spectrum become a bit wider, making it marginally harder to distinguish two very close frequencies. Think of it as a slider between "sharp peaks with noisy leakage" and "clean peaks that are slightly blurred." For a visual spectrum analyzer, the clean-but-slightly-blurred option is overwhelmingly better. The leakage from an un-windowed FFT would make your display look like noise.

There are dozens of window functions beyond Hann — Hamming, Blackman, Kaiser, flat-top — each with slightly different tradeoffs between peak width and leakage suppression. Hann is the default choice for spectrum analyzers because it offers a good balance for visual display. If you were building a precision measurement tool, you might agonize over the choice. For our purposes, Hann is perfect. Use it and move on.

DSP Concept

If you want to build intuition, think of windowing as a "focus" operation. Without it, you're asking the FFT to analyze the entire buffer equally, including the messy edges. With it, you're telling the FFT "focus on the middle of this buffer where the signal is cleanest, and gradually ignore the edges." You lose a bit of data at the margins, but the data you keep is much more accurate.

FFT Size Tradeoffs

The FFT size (N) is the single most important parameter you'll choose for your spectrum analyzer. It controls two things that pull in opposite directions: frequency resolution and time resolution (latency).

Frequency Resolution

More samples in = more bins out = finer frequency detail. The formula again: bin spacing = sampleRate / N.

FFT Size (N)	Bins (N/2)	Bin Spacing	Can Distinguish Notes...
512	256	86.1 Hz	Only octaves apart
1024	512	43.1 Hz	A few semitones apart
2048	1024	21.5 Hz	Adjacent notes (above ~200 Hz)
4096	2048	10.8 Hz	Adjacent notes across most of the range
8192	4096	5.4 Hz	Fine pitch differences

Latency (Time Resolution)

Here's the catch. To fill a buffer of N samples at 44,100 samples/second, you have to wait for N/44100 seconds of audio to arrive. Bigger buffer = longer wait.

FFT Size (N)	Buffer Duration	Feel
512	~12 ms	Instantaneous — but blurry spectrum
1024	~23 ms	Very responsive
2048	~46 ms	Quick, slightly soft
4096	~93 ms	Responsive enough for visualization
8192	~186 ms	Noticeable lag — but razor-sharp spectrum

This is a fundamental tradeoff in signal processing, sometimes called the time-frequency uncertainty principle. You cannot have both perfect frequency resolution and perfect time resolution simultaneously. A longer window gives you a sharper view of frequency but a blurrier view of time. A shorter window gives you precise timing but fuzzy frequency.

DSP Concept

This tradeoff isn't a limitation of the FFT algorithm — it's a fundamental property of signals. It's closely related to the Heisenberg uncertainty principle in physics (you can't know both position and momentum precisely). In the signal processing world, you can't know both "when" and "what frequency" with arbitrary precision. The FFT size is how you choose your compromise point.

The Sweet Spot for a Visual Analyzer

For a spectrum analyzer displayed on a phone screen, 4096 is the sweet spot. Here's why:

93ms latency is below the threshold where humans perceive visual lag. Your bars will appear to move in real time. (Compare to video, which is typically 16-33ms per frame — 93ms is about 5-6 video frames. For a meter display, that's fine.)
10.8 Hz resolution is fine enough to distinguish individual musical notes above about 100 Hz. The notes C3 (130.8 Hz) and C#3 (138.6 Hz) are 7.8 Hz apart — right at the edge, but our log-spaced bar grouping (Section 6) averages multiple bins, which smooths things out.
2048 bins gives us plenty of raw data to aggregate into the 48 display bars we'll create in Section 6.

For a guitar tuner, you'd want 8192 or even 16384 to resolve the fine pitch differences between in-tune and slightly-sharp. For a beat detector, you'd want 512 or 1024 for fast response to transients. The right FFT size depends entirely on what you're building.

Putting It All Together

Let's walk through the complete pipeline that will turn a microphone signal into a spectrum display. No code yet — just the sequence of operations. In Section 6, we'll implement each of these steps.

Capture 4096 samples from the microphone via AVAudioEngine's tap.
Apply the Hann window — multiply each sample by the corresponding window value. This tapers the buffer edges to zero, eliminating spectral leakage.
Run the FFT — feed the windowed buffer to Accelerate's vDSP.FFT. Out come 2048 complex numbers.
Compute magnitudes — for each complex output, calculate sqrt(real² + imag²) to get the amplitude at that frequency. Normalize by dividing by N.
Map to log-spaced bars — group the 2048 linear bins into 48 logarithmically spaced bars so that each bar covers a perceptually equal range of frequencies.
Convert to decibels and normalize to a 0-1 range for display.
Send to SwiftUI — update the published array of bar heights, triggering an animated redraw.

Microphone → [4096 samples] → Window → [4096 windowed] → FFT → [2048 complex] | v SwiftUI bars ← [48 floats 0-1] ← Log grouping ← [2048 magnitudes] ← |z|

Steps 1 and 7 are iOS (AVAudioEngine and SwiftUI). Steps 2 through 6 are DSP. Step 5 is where human perception meets math. This is the complete pipeline for a real-time spectrum analyzer, and it's what professional audio apps do. The only difference between our tutorial app and a production spectrum analyzer is polish: peak hold, averaging, more sophisticated windowing, maybe overlap-add processing. The core pipeline is the same.

Why Not Just Use a Library?

You might be thinking: surely there's a library that does all of this in one function call. And there is — Apple's Accelerate framework handles the heavy lifting. But we're using it as a toolkit, not a black box. Understanding what the FFT does and why windowing matters will save you hours of debugging when your spectrum display looks wrong. "The bars are all noisy" — did you forget windowing? "I can't distinguish close notes" — is your FFT size too small? "The display feels laggy" — is it too large? These are questions you can only answer if you understand the theory.

Besides, the theory is beautiful. You're about to take a messy stream of air-pressure measurements and decompose it into pure tones. That's genuinely amazing, and it'll take about 30 lines of Swift.

Section Summary

Here's what to hold in your head going into Section 6:

The FFT converts time-domain samples into frequency-domain magnitudes. Feed in N floats, get back N/2 frequency bins.
Each bin represents a narrow frequency band. Bin spacing = sampleRate / N.
Window functions (Hann) taper the buffer edges to prevent spectral leakage — fake high-frequency artifacts caused by the buffer's abrupt boundaries.
FFT size controls the resolution/latency tradeoff. We're using 4096: 10.8 Hz resolution with 93ms latency. Good enough for a visual analyzer.
The output needs log-frequency grouping to match human hearing. We'll do that in code.

With this foundation, the code in Section 6 will be refreshingly straightforward. Every line maps directly to one of the concepts above. Let's go build it.

Checkpoint

This was a theory section — nothing to build and run. But you should be able to answer these questions: What does the FFT transform from and to? Why do we apply a window function before the FFT? What determines the frequency resolution of the output? If you can answer those three, you're ready for Section 6.

130 Widgets

5. The Fourier Transform DSP

The Key Insight

Time Domain vs Frequency Domain

The Time Domain

The Frequency Domain

What the FFT Actually Computes

Input and Output

Frequency Bins

What a Bin Really Means

Window Functions and Spectral Leakage

The Tradeoff

FFT Size Tradeoffs

Frequency Resolution

Latency (Time Resolution)

The Sweet Spot for a Visual Analyzer

Putting It All Together

Why Not Just Use a Library?

Section Summary