130 Widgets

Semi-random thoughts and tales of tinkering

3. Capturing Audio on iOS

Both

This is the section where things get real. We are going to build a working app — a VU meter that listens to the microphone and displays the volume level in real time. It will not do spectral analysis yet (that comes in Sections 5 and 6), but it establishes the entire audio pipeline: permissions, capture, processing, and display. Every piece we build here carries forward into the final spectrum analyzer.

iOS Audio Architecture

Apple provides several frameworks for working with audio. The one we care about is AVFoundation, specifically the AVAudioEngine class. Here is the mental model:

┌─────────────┐      ┌──────────────────┐      ┌──────────────┐
│  Microphone  │─────▶│  AVAudioEngine   │─────▶│   Output     │
│  (hardware)  │      │  (audio graph)   │      │   (speaker)  │
└─────────────┘      └──────────────────┘      └──────────────┘
                              │
                              │ installTap()
                              ▼
                     ┌──────────────────┐
                     │  Your callback  │
                     │  (process audio) │
                     └──────────────────┘
            

AVAudioEngine is a real-time audio graph — a pipeline of nodes that process audio. The input node represents the microphone. The output node represents the speaker. You can insert processing nodes between them, but for our purposes we just need to observe what the microphone is picking up.

The mechanism for observing is called a tap. You install a tap on a node, and the engine calls your closure every time a buffer of audio data is ready. That buffer is a chunk of floating-point samples — the raw PCM data we discussed in Section 2.

Coming from Windows

If you have used NAudio or WASAPI on Windows, AVAudioEngine is the equivalent. The difference is that Apple wraps everything in a higher-level graph API. There is no need to manually configure sample rates or buffer formats — the engine negotiates that with the hardware.

Microphone Permissions

iOS will not let your app touch the microphone without explicit user consent. This is a two-step process:

Step 1: Declare intent in Info.plist. Open your project's Info.plist (or the "Info" tab in your target settings) and add this key:

// Info.plist key (shown as raw key + value)
Key:   NSMicrophoneUsageDescription
Value: "VUMeter needs the microphone to measure audio levels"

This string is displayed to the user in the system permission dialog. Make it clear and honest. Apple reviews these strings and will reject your app if the description is vague or misleading.

Step 2: The runtime dialog. The first time your app calls engine.start() on a node connected to the microphone, iOS presents a system dialog: "VUMeter Would Like to Access the Microphone." The user taps Allow or Don't Allow. If they deny it, the audio engine still starts but delivers silence.

Tip

During development, if you accidentally deny the permission and want to reset it, go to Settings → Privacy & Security → Microphone on the device (or Simulator), find your app, and toggle it back on. You can also reset all permissions in the Simulator via Device → Erase All Content and Settings.

Building AudioEngine.swift — Phase 1 (VU Meter Only)

Create a new Swift file in your project called AudioEngine.swift. This is the first version — it only computes volume level (RMS). We will extend it with FFT and spectrum data in later sections.

Here is the complete file:

import AVFoundation
import Observation

@Observable
class AudioEngine {
    var level: Float = 0.0        // 0.0 (silence) to 1.0 (loud)
    var isRunning = false

    private let engine = AVAudioEngine()

    func start() {
        let input = engine.inputNode
        let format = input.outputFormat(forBus: 0)

        input.installTap(onBus: 0, bufferSize: 1024, format: format) { [weak self] buffer, _ in
            let rms = Self.computeRMS(buffer: buffer)
            let normalized = Self.normalize(rms)
            DispatchQueue.main.async {
                self?.level = normalized
            }
        }

        do {
            try engine.start()
            isRunning = true
        } catch {
            print("Audio engine failed to start: \(error)")
        }
    }

    func stop() {
        engine.inputNode.removeTap(onBus: 0)
        engine.stop()
        isRunning = false
        level = 0.0
    }

    // Root Mean Square — the standard way to measure perceived loudness
    private static func computeRMS(buffer: AVAudioPCMBuffer) -> Float {
        guard let channelData = buffer.floatChannelData?[0] else { return 0 }
        let frameCount = Int(buffer.frameLength)
        guard frameCount > 0 else { return 0 }

        var sum: Float = 0
        for i in 0..<frameCount {
            let sample = channelData[i]
            sum += sample * sample
        }
        return sqrt(sum / Float(frameCount))
    }

    // Map raw RMS to a 0–1 display range
    private static func normalize(_ rms: Float) -> Float {
        let db = 20 * log10(max(rms, 1e-6))
        let minDb: Float = -60
        let maxDb: Float = 0
        let clamped = max(minDb, min(maxDb, db))
        return (clamped - minDb) / (maxDb - minDb)
    }
}

That is about 50 lines of code that captures live audio from the microphone, computes a volume level, and exposes it as a reactive property. Let's walk through every important piece.

The @Observable Pattern

@Observable
class AudioEngine {
    var level: Float = 0.0
    var isRunning = false
    // ...
}

@Observable is Swift's equivalent of C#'s INotifyPropertyChanged. When you mark a class with @Observable, the Swift compiler automatically synthesizes change-tracking for every stored property. Any SwiftUI view that reads audio.level will automatically re-render when that value changes.

C# comparison

In WPF/MAUI, you would write a property with a backing field, call OnPropertyChanged() in the setter, and bind to it in XAML. In Swift, @Observable eliminates all of that boilerplate. You just write a normal property and it works. The framework tracks which views read which properties at runtime, not through string-based bindings.

Installing a Tap

let input = engine.inputNode
let format = input.outputFormat(forBus: 0)

input.installTap(onBus: 0, bufferSize: 1024, format: format) { [weak self] buffer, _ in
    let rms = Self.computeRMS(buffer: buffer)
    let normalized = Self.normalize(rms)
    DispatchQueue.main.async {
        self?.level = normalized
    }
}

installTap(onBus:bufferSize:format:) registers a callback on the audio node. Here is what each parameter means:

The trailing closure is your callback. It receives an AVAudioPCMBuffer (the raw samples) and an AVAudioTime (the timestamp, which we ignore with _). This callback fires on a real-time audio thread — not the main thread, not a background queue, but a high-priority thread managed by the audio system.

weak self and Retain Cycles

{ [weak self] buffer, _ in
    // ...
    self?.level = normalized
}

The [weak self] capture list is critical. Without it, here is what happens:

  1. The closure captures a strong reference to self (the AudioEngine instance).
  2. The AudioEngine owns the AVAudioEngine, which owns the tap, which owns the closure.
  3. The closure owns the AudioEngine. Circular reference. Memory leak.

[weak self] makes the closure's reference to self a weak reference — it does not prevent deallocation. If the AudioEngine is deallocated while the tap is still installed, self becomes nil, and the self?.level call safely does nothing.

C# comparison

This is the same concept as WeakReference<T> in C#. In C#, the garbage collector handles most cycles for you, but in Swift (which uses reference counting, not GC), you must break cycles manually. You will see [weak self] in nearly every closure that captures self and might outlive it. It becomes second nature.

Threading Model

DispatchQueue.main.async {
    self?.level = normalized
}

The audio tap callback runs on a real-time audio thread. SwiftUI views must be updated from the main thread. DispatchQueue.main.async dispatches a block of work to the main thread's run loop — it will execute on the next iteration.

C# comparison

This is exactly like Dispatcher.BeginInvoke() in WPF, or MainThread.BeginInvokeOnMainThread() in MAUI. Same concept, slightly different API. In Swift, Grand Central Dispatch (GCD) is the primary concurrency mechanism, and DispatchQueue.main is the queue bound to the main/UI thread.

RMS Computation

private static func computeRMS(buffer: AVAudioPCMBuffer) -> Float {
    guard let channelData = buffer.floatChannelData?[0] else { return 0 }
    let frameCount = Int(buffer.frameLength)
    guard frameCount > 0 else { return 0 }

    var sum: Float = 0
    for i in 0..<frameCount {
        let sample = channelData[i]
        sum += sample * sample
    }
    return sqrt(sum / Float(frameCount))
}

This is the theory from Section 2 turned into code. Let's trace through it:

  1. buffer.floatChannelData?[0] — The buffer stores audio as 32-bit floats, one array per channel. [0] grabs channel 0 (mono, or the left channel of stereo). The ? is Swift's optional chaining — if the buffer somehow has no float data, the entire expression is nil and the guard returns 0.
  2. The loop — For every sample in the buffer:
    • Read the sample (a float between roughly -1.0 and 1.0).
    • Square it. This does two things: makes negative values positive, and emphasizes louder samples (a sample at 0.5 contributes 0.25, but a sample at 1.0 contributes 1.0 — four times as much).
    • Add it to a running sum.
  3. sum / Float(frameCount) — The mean of the squared values.
  4. sqrt(...) — The root of the mean square. This brings the result back into the same scale as the original samples.

The result is a single float — the RMS level — typically between 0.0 (silence) and maybe 0.3 for normal speech. Clapping close to the mic might push it to 0.7. It will rarely hit 1.0 unless the input is distorted.

Why RMS and not peak?

You could just take the maximum absolute sample value (the "peak"). But peak is jumpy and does not correspond well to perceived loudness. A single spike sample could max out the peak while the audio sounds quiet. RMS averages over the entire buffer, giving a much more stable and perceptually meaningful reading. This is why professional VU meters use RMS.

Normalization

private static func normalize(_ rms: Float) -> Float {
    let db = 20 * log10(max(rms, 1e-6))
    let minDb: Float = -60
    let maxDb: Float = 0
    let clamped = max(minDb, min(maxDb, db))
    return (clamped - minDb) / (maxDb - minDb)
}

Raw RMS values are not great for display. Normal speech might be 0.01 to 0.05 — you would barely see the meter move. We need to convert to a logarithmic (decibel) scale, then map to 0–1.

  1. 20 * log10(rms) — Converts amplitude to decibels. An RMS of 1.0 is 0 dB (maximum). An RMS of 0.01 is -40 dB. An RMS of 0.001 is -60 dB. The max(rms, 1e-6) guard prevents log10(0), which would be negative infinity.
  2. Clamping to [-60, 0] — We define our display range. Anything below -60 dB is effectively silence; anything above 0 dB is clipping. max(minDb, min(maxDb, db)) clamps the value into that range.
  3. Mapping to 0–1(clamped - minDb) / (maxDb - minDb) is a standard linear interpolation. -60 dB maps to 0.0, 0 dB maps to 1.0, -30 dB maps to 0.5. This is the value we hand to the UI.
The decibel scale and human hearing

Human hearing is logarithmic — we perceive the difference between -60 dB and -40 dB as roughly the same "jump" as -40 dB to -20 dB, even though in linear terms the second jump is 100x larger. By converting to decibels before mapping to the display, our meter moves in a way that matches how we actually hear sound. This is fundamental to every audio meter, VU or otherwise.

The VU Meter UI

Now let's build the view. Open ContentView.swift (the file Xcode created for you) and replace its contents with this:

import SwiftUI

struct ContentView: View {
    @State private var audio = AudioEngine()

    var body: some View {
        VStack(spacing: 40) {
            Text("VU Meter")
                .font(.largeTitle)
                .bold()

            GeometryReader { geo in
                ZStack(alignment: .bottom) {
                    RoundedRectangle(cornerRadius: 12)
                        .fill(Color.gray.opacity(0.2))

                    RoundedRectangle(cornerRadius: 12)
                        .fill(meterColor(level: audio.level))
                        .frame(height: geo.size.height * CGFloat(audio.level))
                        .animation(.easeOut(duration: 0.05), value: audio.level)
                }
            }
            .frame(width: 80, height: 300)

            Text(dbLabel(level: audio.level))
                .font(.title2.monospacedDigit())
                .foregroundStyle(.secondary)

            Button(audio.isRunning ? "Stop" : "Start") {
                if audio.isRunning {
                    audio.stop()
                } else {
                    audio.start()
                }
            }
            .buttonStyle(.borderedProminent)
            .tint(audio.isRunning ? .red : .green)
        }
        .padding(40)
    }

    func meterColor(level: Float) -> Color {
        switch level {
        case 0..<0.6:  return .green
        case 0.6..<0.85: return .yellow
        default:       return .red
        }
    }

    func dbLabel(level: Float) -> String {
        let db = (level * 60) - 60
        if level < 0.001 { return "-∞ dB" }
        return String(format: "%.1f dB", db)
    }
}

This gives us a vertical bar that fills from bottom to top, changes color at threshold levels, displays the current dB reading, and has a start/stop button. Let's break down the key SwiftUI concepts.

@State and @Observable

@State private var audio = AudioEngine()

@State tells SwiftUI: "own this object's lifecycle." SwiftUI creates the AudioEngine instance when the view first appears and keeps it alive across re-renders. Combined with @Observable on the class, SwiftUI tracks exactly which properties the view's body reads. When audio.level changes, only the parts of the view that depend on level are recomputed — not the entire view tree.

C# comparison

Think of @State as the SwiftUI equivalent of a view model that is scoped to a particular view. In WPF, you would set DataContext = new AudioEngine() and bind properties in XAML. SwiftUI collapses binding, change notification, and lifecycle management into two keywords: @State on the view side, @Observable on the model side.

GeometryReader

GeometryReader { geo in
    ZStack(alignment: .bottom) {
        RoundedRectangle(cornerRadius: 12)
            .fill(Color.gray.opacity(0.2))

        RoundedRectangle(cornerRadius: 12)
            .fill(meterColor(level: audio.level))
            .frame(height: geo.size.height * CGFloat(audio.level))
    }
}
.frame(width: 80, height: 300)

GeometryReader is SwiftUI's way of giving you access to the parent container's size. The closure receives a GeometryProxy (here called geo) with properties like geo.size.width and geo.size.height.

We set the outer frame to 80x300 points. Inside, we stack two rounded rectangles with ZStack (a z-axis stack — layers on top of each other). The gray one is the background track. The colored one is the fill, whose height equals the container height times the audio level. When audio.level is 0.5, the colored bar is 150 points tall. When it is 1.0, it fills the whole container.

C# comparison

This is like measuring a WPF Grid or Canvas with ActualWidth/ActualHeight and using those values to size a child element. The difference is that in SwiftUI, GeometryReader is reactive — when the container resizes (say, on device rotation), the inner content automatically recomputes.

Implicit Animation

.animation(.easeOut(duration: 0.05), value: audio.level)

This single modifier makes the meter bar animate smoothly. Whenever audio.level changes, SwiftUI interpolates the frame height from the old value to the new value over 0.05 seconds with an ease-out curve. No animation state variables. No timers. No manual interpolation. You declare "animate this property" and the framework handles it.

The duration of 0.05 seconds (50 ms) is a deliberate choice. Our audio tap fires roughly every 23 ms (1024 samples at 44.1 kHz), so 50 ms gives the animation just enough time to smooth out jitter without adding noticeable lag. Try changing it to 0.3 and you will see the meter feel sluggish. Remove it entirely and the meter will stutter.

Pattern Matching

func meterColor(level: Float) -> Color {
    switch level {
    case 0..<0.6:  return .green
    case 0.6..<0.85: return .yellow
    default:       return .red
    }
}

Swift's switch can match against ranges, which is something you cannot do directly in C#. The 0..<0.6 syntax creates a half-open range (includes 0, excludes 0.6). The result: the meter is green up to 60% of maximum, yellow from 60% to 85%, and red above 85%. This mimics the color scheme of classic analog VU meters.

Also notice that Swift's switch does not fall through by default (no break needed), and the compiler checks exhaustiveness — the default case is required because the compiler cannot prove our ranges cover all possible Float values.

Button and Conditional UI

Button(audio.isRunning ? "Stop" : "Start") {
    if audio.isRunning {
        audio.stop()
    } else {
        audio.start()
    }
}
.buttonStyle(.borderedProminent)
.tint(audio.isRunning ? .red : .green)

The button label and tint color both depend on audio.isRunning. When the engine is running, you see a red "Stop" button. When it is stopped, you see a green "Start" button. Because isRunning is a property on an @Observable object, the button automatically re-renders when it changes. No manual state synchronization.

The .buttonStyle(.borderedProminent) modifier gives us a filled, rounded button — the standard iOS "primary action" style. .tint sets the fill color.

First Runnable App

Checkpoint

You now have two files: AudioEngine.swift and ContentView.swift. Build and run (⌘R). When the app launches, tap Start. iOS will present the microphone permission dialog — tap Allow.

You should see the green bar responding to sound in real time. Try:

  • Speaking at normal volume — the bar should hover around 30-50%.
  • Clapping near the phone — the bar should spike into yellow or red.
  • Playing music from another device — you should see a steady, fluctuating level.
  • Silence — the bar should drop to zero (or near zero).

This is a working VU meter. It is a real audio application capturing live input, computing a meaningful metric, and displaying it with smooth animation. Everything from here builds on top of this foundation.

Experiments to Try

Before moving on, try these modifications to deepen your understanding:

What's next

The VU meter computes a single number per buffer. The spectrum analyzer will compute hundreds of numbers — one for each frequency band. But first, in Section 4, we will optimize our math using Apple's Accelerate framework. The hand-written for loop in computeRMS works fine for 1024 samples, but the FFT will need to process thousands of samples many times per second. We need faster tools.