130 Widgets

Semi-random thoughts and tales of tinkering

2. What Is Sound? DSP

No code in this section. We're going to build a mental model of sound — from the physical vibrations in the air all the way to the single floating-point number our app will display on screen. If you've never worked with audio or signals before, this is the section that makes everything else click.

Every concept here maps directly to code we'll write later. By the end, when you see a buffer of 4096 floats arrive from the microphone, you'll know exactly what those numbers mean and what to do with them.

Sound as Vibration

Sound is pressure waves in air. Something vibrates — a guitar string, a vocal cord, a speaker cone — and pushes air molecules back and forth. Those compressions and rarefactions travel outward at about 343 meters per second (at room temperature) and eventually hit your eardrum, which vibrates in sympathy.

Two properties of these waves matter most to us:

In programmer terms: imagine an array of values oscillating between -1.0 and 1.0. The speed of oscillation is frequency. The distance from zero at the peaks is amplitude. That's sound.

Pressure ^ 1.0 | * * * * | * * * * | * * * * 0.0 |--*--------*--------*--------*----> Time | * * * | * * * -1.0| * * * <-- one cycle --> frequency = cycles per second (Hz) amplitude = peak distance from 0

Real-world sound is never a clean sine wave like this. When you speak, dozens of frequencies combine simultaneously — the fundamental frequency of your voice, its harmonics, the resonances of your mouth and throat. But the principle is the same: it's all waves superimposed on each other.

Digital Audio — Sampling

Your iPhone's microphone is a tiny membrane that vibrates with incoming sound. Behind it, a circuit converts that physical vibration into an electrical voltage that rises and falls in the same pattern. So far, we're still in the analog world — continuous, smooth, infinite detail.

To get this signal into the digital world where our code can work with it, we need an Analog-to-Digital Converter (ADC). The ADC does something deceptively simple: at regular intervals, it measures the current voltage and writes down a number.

That's it. That's sampling.

Each measurement is called a sample. The stream of samples is called PCM audio — Pulse Code Modulation. Don't let the name intimidate you. PCM just means "a sequence of numbers representing amplitude values at regular time intervals."

Analog signal (continuous) Digital samples (discrete) ~~~~~ * ~ ~ * * ~ ~ * * ~ ~ ADC * * ~ ~ ----> * * ~ ~ * * ~ ~ * * ~ ~ * * ~~~~~ * * * Smooth, infinite resolution Fixed-rate snapshots Each * is one Float value

In our app, each sample will be a Float between -1.0 and 1.0, where 0.0 is silence (no pressure change), 1.0 is the maximum positive pressure the system can represent, and -1.0 is the maximum negative. This normalized range is standard in audio programming — it's the equivalent of working with percentages instead of raw pixel values in graphics.

DSP Concept

If you're used to working with integer data types, the float-based range might seem odd. Historically, audio was stored as 16-bit integers (-32768 to 32767), and CD audio still works this way. But modern audio APIs — including Apple's Core Audio — use 32-bit floats internally. Floats give you more headroom, no clipping at exact boundaries, and simpler math. When we do sample * sample later, we don't have to worry about integer overflow.

Sample Rates

How often does the ADC take a snapshot? That's the sample rate, measured in Hz (samples per second). Two numbers dominate:

Sample Rate Context Samples Per Second
44,100 Hz CD quality, some older devices 44,100
48,000 Hz Modern iOS devices, video, professional audio 48,000

Why these specific numbers? The answer comes from one of the most important theorems in signal processing.

The Nyquist-Shannon Sampling Theorem

In 1928, Harry Nyquist proved something that seems almost too clean to be true: to perfectly capture a frequency, you must sample at least twice per cycle.

Think about it intuitively. If a wave completes one cycle per second (1 Hz), and you only take one sample per second, you might always catch the wave at the same point — say, always at zero. You'd see a flat line. You'd completely miss the oscillation. But if you sample twice per cycle, you catch the peak and the trough, and that's enough information to reconstruct the wave.

Humans can hear frequencies from roughly 20 Hz (a deep rumble) to 20,000 Hz (a high-pitched whine that gets harder to hear with age). To capture 20,000 Hz, we need at least 40,000 samples per second. CD audio uses 44,100 Hz — a bit more than the theoretical minimum to leave room for the hardware anti-aliasing filter (more on that in a moment). 48,000 Hz is even more generous.

The maximum frequency that a given sample rate can represent is called the Nyquist frequency:

Nyquist frequency = sampleRate / 2 For 44,100 Hz: Nyquist = 22,050 Hz For 48,000 Hz: Nyquist = 24,000 Hz

Both are comfortably above 20 kHz, so both capture the full range of human hearing.

Nyquist Frequency and Aliasing

What happens if a frequency above the Nyquist limit sneaks into the signal? Something strange and bad: aliasing.

Imagine a wheel spinning at 25 revolutions per second, filmed at 24 frames per second. To your eyes (via the camera), the wheel appears to rotate slowly backwards at 1 revolution per second. The sampling rate (frame rate) wasn't high enough to capture the true motion, so the high frequency "folded" into a lower, phantom frequency.

The same thing happens with audio. A 23,000 Hz tone sampled at 44,100 Hz would alias down to a frequency that wasn't in the original signal — a ghost frequency that corrupts your audio. You'd hear something that wasn't there.

Frequency spectrum with aliasing: Real signal | Ghost (aliased) | * | * * * | * * * * | * * * * | * * * * | * * --+------+------+--------+------+--------> Frequency 0 10kHz 20kHz Nyquist fold-back (22.05kHz)

In practice, you never have to deal with this manually. The ADC hardware includes an anti-aliasing filter — an analog low-pass filter that removes frequencies above the Nyquist limit before sampling occurs. By the time the data reaches your app, aliasing has already been prevented.

DSP Concept

Aliasing matters more when you're generating or resampling audio in software. If you ever downsample (convert 48 kHz to 22 kHz, for example), you need to apply a digital low-pass filter first, or you'll introduce aliasing yourself. For our spectrum analyzer, which only reads from the microphone, the hardware handles it for us.

Buffers and Frames

Audio doesn't arrive one sample at a time. Imagine if your app received a callback 48,000 times per second, once per sample — the overhead would be catastrophic. Instead, samples arrive in chunks called buffers.

A buffer is just an array. A buffer of 1024 frames at 48,000 Hz contains 1024 consecutive samples and represents:

buffer duration = frames / sampleRate 1024 / 48000 = 0.0213 seconds = ~21 ms

In code, you'll get something like a [Float] with 1024 elements, delivered to your callback function roughly every 21 milliseconds. Process it, return, and wait for the next one.

The Buffer Size Tradeoff

This is one of the fundamental tradeoffs in audio programming, and it's worth understanding deeply:

Small Buffer (256 frames) Large Buffer (4096 frames)
Duration ~5 ms ~85 ms
Latency Very low — great for real-time effects Higher — noticeable delay
CPU overhead High — callback fires ~188 times/sec Low — callback fires ~12 times/sec
Frequency resolution Poor — hard to distinguish low frequencies Good — can resolve fine frequency detail

For a real-time audio effect (like a guitar pedal app), you need tiny buffers — the musician can't tolerate 85 ms of delay. For our visual spectrum analyzer, we don't care about latency in the same way. Nobody notices if the meter updates 85 milliseconds after the sound. What we do care about is having enough samples to calculate accurate frequency content.

We'll use 4096 frames per buffer. At 48,000 Hz, that's about 85 ms of audio per chunk — roughly 12 updates per second. Plenty smooth for a visual display, and enough samples for good frequency analysis when we get to the FFT in later sections.

Tip

If you're thinking "12 updates per second sounds jerky," you're right — for raw values. In practice, we'll use smoothing (exponential moving average) to interpolate between updates and animate at 60 fps. The data arrives at 12 Hz; the display updates at 60 Hz. This is the same trick game engines use when physics runs at a different rate than rendering.

RMS — Measuring Loudness

We have a buffer of 4096 floats, each between -1.0 and 1.0. How do we turn that into a single number that represents "how loud is it right now"?

Your first instinct might be to average all the samples. But that won't work. Audio oscillates symmetrically around zero — for every positive sample, there's a roughly equal negative sample. The average of a perfectly oscillating signal is zero, regardless of how loud it is.

You could average the absolute values. That works, sort of. But the standard approach in audio engineering is RMS: Root Mean Square. It's what VU meters and loudness meters have used for decades, and there's a good reason.

The Formula

RMS has three steps, right there in the name:

Given N samples: x[0], x[1], x[2], ... x[N-1] 1. Square each sample: x[i]² 2. Take the Mean: sum(x[i]²) / N 3. Take the square Root: sqrt( sum(x[i]²) / N ) ___________________________ / x[0]² + x[1]² + ... + x[N-1]² RMS = / ──────────────────────────── \/ N

Why squaring? Two reasons. First, it eliminates the sign problem — negative samples become positive after squaring. Second, and more importantly, squaring weights larger values more heavily. A few loud peaks contribute more to the RMS than many quiet samples. This matches human perception — we perceive loudness based on the energy (power) of the signal, and power is proportional to amplitude squared.

If that's too abstract, think of it this way: RMS is the standard deviation of the signal, assuming a mean of zero (which audio has). If you've ever computed standard deviation in a data pipeline, you've already done RMS.

Typical RMS Values

Sound Approximate RMS
Silence 0.0 (or very close)
Quiet room ambiance 0.001 – 0.005
Normal speech 0.01 – 0.1
Loud music close to mic 0.1 – 0.5
Maximum (clipping) ~0.707 (sine wave at full scale)

Notice the enormous range. Quiet room to loud music spans two orders of magnitude (0.003 to 0.3 is a 100x difference). This makes RMS awkward to display on a linear scale — quiet sounds would be invisible. We need a better scale.

Decibels — The Logarithmic Scale

Human hearing is logarithmic. Doubling the perceived loudness doesn't require double the amplitude — it requires about ten times the power. Our ears compress an absolutely enormous dynamic range into a manageable perception of "quiet" to "loud."

To match this, audio engineers use decibels (dB), a logarithmic unit:

dB = 20 * log10(amplitude) Where amplitude is our RMS value (0.0 to 1.0)

Let's plug in some values to build intuition:

RMS Amplitude Decibels Meaning
1.0 0 dB Full scale — the loudest representable signal
0.5 -6 dB Half amplitude
0.1 -20 dB One-tenth amplitude
0.01 -40 dB One-hundredth amplitude
0.001 -60 dB One-thousandth amplitude
0.0 -∞ dB Digital silence

Two things to note. First, 0 dB is not silence — it's the maximum. This trips up everyone the first time. In digital audio, 0 dB means "the signal is using the full available range." All real signals are negative dB values. This is called dBFS (decibels Full Scale).

Second, log10(0) is negative infinity. Silence = -∞ dB. In code, we'll need to handle this — you can't display infinity on screen. We'll clamp it to a practical floor.

From dB to Display

For our VU meter, we need to map a dB value to a position on screen (0.0 to 1.0). We'll define a range of -60 dB to 0 dB. Anything quieter than -60 dB is effectively silence for our purposes. The formula is:

normalized = (dB - minDB) / (maxDB - minDB) Where minDB = -60, maxDB = 0 Example: -20 dB → (-20 - (-60)) / (0 - (-60)) = 40/60 = 0.667 Example: -60 dB → (-60 - (-60)) / (0 - (-60)) = 0/60 = 0.0 Example: 0 dB → (0 - (-60)) / (0 - (-60)) = 60/60 = 1.0

Clamp the result to [0, 1], and you have a value you can directly multiply by the meter's height in pixels. Zero means the bar is empty; 1.0 means the bar is full. Simple.

The Full Pipeline

Let's put everything together. Here's the complete chain from air vibrations to a pixel height on your iPhone screen:

Sound waves Microphone ADC Audio buffer in the air → (membrane → (analog to → [Float] array (pressure) vibrates) digital) of N samples Audio buffer RMS dB conversion Normalize [0.02, -0.05, → Square each, → 20 * log10 → (dB - min) 0.03, -0.01, average, (rms) / (max - min) ...] square root Normalized value Display 0.0 ... 1.0 → meter height = value * barHeight That's our VU meter.

In code, the whole thing is about ten lines of Swift (which we'll write in Section 4). But understanding why those ten lines work — that's what this section was for.

A Concrete Example

Let's walk through the math with real numbers. Suppose we have a buffer of 8 samples (in reality it's 4096, but the math is identical):

Samples: [0.1, -0.15, 0.2, -0.18, 0.12, -0.1, 0.14, -0.11] Step 1: Square each [0.01, 0.0225, 0.04, 0.0324, 0.0144, 0.01, 0.0196, 0.0121] Step 2: Mean sum = 0.161 mean = 0.161 / 8 = 0.020125 Step 3: Square root RMS = sqrt(0.020125) = 0.1419 Step 4: Convert to dB dB = 20 * log10(0.1419) = 20 * (-0.848) = -16.96 dB Step 5: Normalize (range -60 to 0) normalized = (-16.96 - (-60)) / (0 - (-60)) = 43.04 / 60 = 0.717 Result: the meter bar fills to 71.7% of its height.

That's a moderately loud signal — maybe someone talking at a normal volume close to the microphone. The meter would be about two-thirds full. Makes sense.

Looking Ahead

This section covered the concepts behind a volume meter — a single bar that goes up and down with loudness. That's what we'll build first (Sections 3–5). But the tutorial is called "Building a Spectrum Analyzer," and a spectrum analyzer does something much more interesting: it shows which frequencies are present in the sound.

To do that, we'll need the FFT (Fast Fourier Transform), which takes our buffer of time-domain samples and transforms it into frequency-domain data — a list of "how much energy is at each frequency." But that comes later. For now, the concepts in this section are the foundation everything else builds on:

In the next section, we'll go back to Xcode and capture real audio from the iPhone's microphone using Apple's AVAudioEngine API. The buffer of floats we've been talking about will become an actual variable in your code.

Checkpoint

You should be able to explain, without looking back: what a sample is, why the sample rate is ~44–48 kHz, what RMS measures, why we use decibels instead of raw amplitude, and how we map dB to a 0–1 display value. If any of those are fuzzy, re-read that section — everything from here forward assumes this foundation.