Semi-random thoughts and tales of tinkering
No code in this section. We're going to build a mental model of sound — from the physical vibrations in the air all the way to the single floating-point number our app will display on screen. If you've never worked with audio or signals before, this is the section that makes everything else click.
Every concept here maps directly to code we'll write later. By the end, when you see a buffer of 4096 floats arrive from the microphone, you'll know exactly what those numbers mean and what to do with them.
Sound is pressure waves in air. Something vibrates — a guitar string, a vocal cord, a speaker cone — and pushes air molecules back and forth. Those compressions and rarefactions travel outward at about 343 meters per second (at room temperature) and eventually hit your eardrum, which vibrates in sympathy.
Two properties of these waves matter most to us:
In programmer terms: imagine an array of values oscillating between -1.0 and 1.0. The speed of oscillation is frequency. The distance from zero at the peaks is amplitude. That's sound.
Real-world sound is never a clean sine wave like this. When you speak, dozens of frequencies combine simultaneously — the fundamental frequency of your voice, its harmonics, the resonances of your mouth and throat. But the principle is the same: it's all waves superimposed on each other.
Your iPhone's microphone is a tiny membrane that vibrates with incoming sound. Behind it, a circuit converts that physical vibration into an electrical voltage that rises and falls in the same pattern. So far, we're still in the analog world — continuous, smooth, infinite detail.
To get this signal into the digital world where our code can work with it, we need an Analog-to-Digital Converter (ADC). The ADC does something deceptively simple: at regular intervals, it measures the current voltage and writes down a number.
That's it. That's sampling.
Each measurement is called a sample. The stream of samples is called PCM audio — Pulse Code Modulation. Don't let the name intimidate you. PCM just means "a sequence of numbers representing amplitude values at regular time intervals."
In our app, each sample will be a Float between -1.0 and 1.0, where 0.0 is silence (no pressure change), 1.0 is the maximum positive pressure the system can represent, and -1.0 is the maximum negative. This normalized range is standard in audio programming — it's the equivalent of working with percentages instead of raw pixel values in graphics.
If you're used to working with integer data types, the float-based range might seem odd. Historically, audio was stored as 16-bit integers (-32768 to 32767), and CD audio still works this way. But modern audio APIs — including Apple's Core Audio — use 32-bit floats internally. Floats give you more headroom, no clipping at exact boundaries, and simpler math. When we do sample * sample later, we don't have to worry about integer overflow.
How often does the ADC take a snapshot? That's the sample rate, measured in Hz (samples per second). Two numbers dominate:
| Sample Rate | Context | Samples Per Second |
|---|---|---|
| 44,100 Hz | CD quality, some older devices | 44,100 |
| 48,000 Hz | Modern iOS devices, video, professional audio | 48,000 |
Why these specific numbers? The answer comes from one of the most important theorems in signal processing.
In 1928, Harry Nyquist proved something that seems almost too clean to be true: to perfectly capture a frequency, you must sample at least twice per cycle.
Think about it intuitively. If a wave completes one cycle per second (1 Hz), and you only take one sample per second, you might always catch the wave at the same point — say, always at zero. You'd see a flat line. You'd completely miss the oscillation. But if you sample twice per cycle, you catch the peak and the trough, and that's enough information to reconstruct the wave.
Humans can hear frequencies from roughly 20 Hz (a deep rumble) to 20,000 Hz (a high-pitched whine that gets harder to hear with age). To capture 20,000 Hz, we need at least 40,000 samples per second. CD audio uses 44,100 Hz — a bit more than the theoretical minimum to leave room for the hardware anti-aliasing filter (more on that in a moment). 48,000 Hz is even more generous.
The maximum frequency that a given sample rate can represent is called the Nyquist frequency:
Both are comfortably above 20 kHz, so both capture the full range of human hearing.
What happens if a frequency above the Nyquist limit sneaks into the signal? Something strange and bad: aliasing.
Imagine a wheel spinning at 25 revolutions per second, filmed at 24 frames per second. To your eyes (via the camera), the wheel appears to rotate slowly backwards at 1 revolution per second. The sampling rate (frame rate) wasn't high enough to capture the true motion, so the high frequency "folded" into a lower, phantom frequency.
The same thing happens with audio. A 23,000 Hz tone sampled at 44,100 Hz would alias down to a frequency that wasn't in the original signal — a ghost frequency that corrupts your audio. You'd hear something that wasn't there.
In practice, you never have to deal with this manually. The ADC hardware includes an anti-aliasing filter — an analog low-pass filter that removes frequencies above the Nyquist limit before sampling occurs. By the time the data reaches your app, aliasing has already been prevented.
Aliasing matters more when you're generating or resampling audio in software. If you ever downsample (convert 48 kHz to 22 kHz, for example), you need to apply a digital low-pass filter first, or you'll introduce aliasing yourself. For our spectrum analyzer, which only reads from the microphone, the hardware handles it for us.
Audio doesn't arrive one sample at a time. Imagine if your app received a callback 48,000 times per second, once per sample — the overhead would be catastrophic. Instead, samples arrive in chunks called buffers.
A buffer is just an array. A buffer of 1024 frames at 48,000 Hz contains 1024 consecutive samples and represents:
In code, you'll get something like a [Float] with 1024 elements, delivered to your callback function roughly every 21 milliseconds. Process it, return, and wait for the next one.
This is one of the fundamental tradeoffs in audio programming, and it's worth understanding deeply:
| Small Buffer (256 frames) | Large Buffer (4096 frames) | |
|---|---|---|
| Duration | ~5 ms | ~85 ms |
| Latency | Very low — great for real-time effects | Higher — noticeable delay |
| CPU overhead | High — callback fires ~188 times/sec | Low — callback fires ~12 times/sec |
| Frequency resolution | Poor — hard to distinguish low frequencies | Good — can resolve fine frequency detail |
For a real-time audio effect (like a guitar pedal app), you need tiny buffers — the musician can't tolerate 85 ms of delay. For our visual spectrum analyzer, we don't care about latency in the same way. Nobody notices if the meter updates 85 milliseconds after the sound. What we do care about is having enough samples to calculate accurate frequency content.
We'll use 4096 frames per buffer. At 48,000 Hz, that's about 85 ms of audio per chunk — roughly 12 updates per second. Plenty smooth for a visual display, and enough samples for good frequency analysis when we get to the FFT in later sections.
If you're thinking "12 updates per second sounds jerky," you're right — for raw values. In practice, we'll use smoothing (exponential moving average) to interpolate between updates and animate at 60 fps. The data arrives at 12 Hz; the display updates at 60 Hz. This is the same trick game engines use when physics runs at a different rate than rendering.
We have a buffer of 4096 floats, each between -1.0 and 1.0. How do we turn that into a single number that represents "how loud is it right now"?
Your first instinct might be to average all the samples. But that won't work. Audio oscillates symmetrically around zero — for every positive sample, there's a roughly equal negative sample. The average of a perfectly oscillating signal is zero, regardless of how loud it is.
You could average the absolute values. That works, sort of. But the standard approach in audio engineering is RMS: Root Mean Square. It's what VU meters and loudness meters have used for decades, and there's a good reason.
RMS has three steps, right there in the name:
Why squaring? Two reasons. First, it eliminates the sign problem — negative samples become positive after squaring. Second, and more importantly, squaring weights larger values more heavily. A few loud peaks contribute more to the RMS than many quiet samples. This matches human perception — we perceive loudness based on the energy (power) of the signal, and power is proportional to amplitude squared.
If that's too abstract, think of it this way: RMS is the standard deviation of the signal, assuming a mean of zero (which audio has). If you've ever computed standard deviation in a data pipeline, you've already done RMS.
| Sound | Approximate RMS |
|---|---|
| Silence | 0.0 (or very close) |
| Quiet room ambiance | 0.001 – 0.005 |
| Normal speech | 0.01 – 0.1 |
| Loud music close to mic | 0.1 – 0.5 |
| Maximum (clipping) | ~0.707 (sine wave at full scale) |
Notice the enormous range. Quiet room to loud music spans two orders of magnitude (0.003 to 0.3 is a 100x difference). This makes RMS awkward to display on a linear scale — quiet sounds would be invisible. We need a better scale.
Human hearing is logarithmic. Doubling the perceived loudness doesn't require double the amplitude — it requires about ten times the power. Our ears compress an absolutely enormous dynamic range into a manageable perception of "quiet" to "loud."
To match this, audio engineers use decibels (dB), a logarithmic unit:
Let's plug in some values to build intuition:
| RMS Amplitude | Decibels | Meaning |
|---|---|---|
| 1.0 | 0 dB | Full scale — the loudest representable signal |
| 0.5 | -6 dB | Half amplitude |
| 0.1 | -20 dB | One-tenth amplitude |
| 0.01 | -40 dB | One-hundredth amplitude |
| 0.001 | -60 dB | One-thousandth amplitude |
| 0.0 | -∞ dB | Digital silence |
Two things to note. First, 0 dB is not silence — it's the maximum. This trips up everyone the first time. In digital audio, 0 dB means "the signal is using the full available range." All real signals are negative dB values. This is called dBFS (decibels Full Scale).
Second, log10(0) is negative infinity. Silence = -∞ dB. In code, we'll need to handle this — you can't display infinity on screen. We'll clamp it to a practical floor.
For our VU meter, we need to map a dB value to a position on screen (0.0 to 1.0). We'll define a range of -60 dB to 0 dB. Anything quieter than -60 dB is effectively silence for our purposes. The formula is:
Clamp the result to [0, 1], and you have a value you can directly multiply by the meter's height in pixels. Zero means the bar is empty; 1.0 means the bar is full. Simple.
Let's put everything together. Here's the complete chain from air vibrations to a pixel height on your iPhone screen:
In code, the whole thing is about ten lines of Swift (which we'll write in Section 4). But understanding why those ten lines work — that's what this section was for.
Let's walk through the math with real numbers. Suppose we have a buffer of 8 samples (in reality it's 4096, but the math is identical):
That's a moderately loud signal — maybe someone talking at a normal volume close to the microphone. The meter would be about two-thirds full. Makes sense.
This section covered the concepts behind a volume meter — a single bar that goes up and down with loudness. That's what we'll build first (Sections 3–5). But the tutorial is called "Building a Spectrum Analyzer," and a spectrum analyzer does something much more interesting: it shows which frequencies are present in the sound.
To do that, we'll need the FFT (Fast Fourier Transform), which takes our buffer of time-domain samples and transforms it into frequency-domain data — a list of "how much energy is at each frequency." But that comes later. For now, the concepts in this section are the foundation everything else builds on:
In the next section, we'll go back to Xcode and capture real audio from the iPhone's microphone using Apple's AVAudioEngine API. The buffer of floats we've been talking about will become an actual variable in your code.
You should be able to explain, without looking back: what a sample is, why the sample rate is ~44–48 kHz, what RMS measures, why we use decibels instead of raw amplitude, and how we map dB to a 0–1 display value. If any of those are fuzzy, re-read that section — everything from here forward assumes this foundation.