Semi-random thoughts and tales of tinkering
Both
In Section 3, we wrote a for loop to compute RMS. It works. For 1024 samples, it is
fast enough that you would never notice a problem. But we are about to enter the world of the Fast
Fourier Transform, where we need to do thousands of multiplications and additions per audio buffer,
dozens of times per second. A naive loop will not cut it.
Fortunately, Apple ships a solution right in the OS.
Accelerate is Apple's SIMD-optimized math library. It has been part of macOS and iOS for over a decade, and it is fast — seriously fast. The sub-library we care about is called vDSP (vector digital signal processing), which provides optimized routines for exactly the kind of operations audio code needs: squaring arrays, computing sums, applying the FFT, and more.
┌──────────────────────────────────────────────────────┐
│ Accelerate Framework │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────────────┐ │
│ │ vDSP │ │ vImage │ │ BLAS / LAPACK │ │
│ │ signal │ │ image │ │ linear algebra │ │
│ │ process │ │ process │ │ │ │
│ └──────────┘ └──────────┘ └──────────────────┘ │
│ │
│ Runs on: SIMD units (NEON), Apple AMX coprocessor │
└──────────────────────────────────────────────────────┘
Under the hood, Accelerate uses NEON SIMD instructions on ARM (the CPU in every iPhone) to process 4 or 8 floats in a single instruction. On Apple Silicon Macs, it can also use the AMX coprocessor for even wider operations. You do not need to know any of this to use it — you just call the function and it picks the fastest path for the hardware.
The closest equivalents in the .NET world are System.Numerics.Vector<T>
(which uses hardware SIMD intrinsics), or libraries like MathNet.Numerics.
The key difference is that Accelerate is built into the OS, tuned specifically for Apple
hardware, and available on every device without adding a NuGet package. It is the standard
way to do fast math on Apple platforms — not a third-party optimization, but the expected
approach.
Let's see the transformation. Here is our original hand-written loop:
// Before: manual loop
var sum: Float = 0
for i in 0..<frameCount {
let sample = channelData[i]
sum += sample * sample
}
return sqrt(sum / Float(frameCount))
And here is the vDSP replacement:
// After: vDSP-accelerated
private static func computeRMS(samples: [Float]) -> Float {
guard !samples.isEmpty else { return 0 }
var sum: Float = 0
vDSP_svesq(samples, 1, &sum, vDSP_Length(samples.count))
return sqrt(sum / Float(samples.count))
}
One function call replaces the entire loop. Let's unpack vDSP_svesq:
s = scalar result, v = vector
input, e = elements, sq = squared. So: "scalar sum of vector elements
squared." Apple's vDSP naming is terse, but there is a logic to it once you learn the
pattern.samples — The input array. Swift automatically bridges a
[Float] array to the UnsafePointer<Float> that the C function
expects.1 — The stride. Process every element. A stride of 2 would
skip every other element (useful for interleaved stereo data where you want just one channel).&sum — An inout parameter. The function writes
its result here. The & prefix is like C#'s ref keyword — it
passes a reference to the variable, not a copy.vDSP_Length(samples.count) — The number of elements to
process. vDSP_Length is just a type alias for UInt.Internally, vDSP_svesq loads 4 floats at a time into a SIMD register, multiplies
them all by themselves in parallel, and accumulates the results. For our 1024-sample buffer, that
is roughly 256 SIMD operations instead of 1024 scalar ones. For the larger buffers we will use
with the FFT (4096+ samples), the speedup is even more significant.
Honestly, for a 1024-element RMS calculation, the difference is negligible — maybe microseconds. The Swift compiler might even auto-vectorize the manual loop. We are making this change now not because the VU meter needs it, but because it introduces the Accelerate pattern we will rely on heavily in Section 5 when we implement the FFT. Think of this as practice with a simple case before tackling a complex one.
You might have noticed the function signature changed. The old version took an
AVAudioPCMBuffer; the new one takes [Float]. That means we need to
extract the samples from the buffer first:
guard let channelData = buffer.floatChannelData?[0] else { return }
let samples = Array(UnsafeBufferPointer(start: channelData,
count: Int(buffer.frameLength)))
Here is what is happening:
buffer.floatChannelData?[0] — Returns an
UnsafeMutablePointer<Float>, which is a raw pointer to the first sample.
This is the audio data sitting in a system-managed memory buffer.UnsafeBufferPointer(start:count:) — Wraps the raw pointer
with a length, creating a type that knows its bounds. This is conceptually identical to
Span<float> in C# — a safe-ish view over unmanaged memory that prevents
you from accidentally reading past the end.Array(...) — Copies the data into a regular Swift
[Float] array. Now we have a normal, safe, reference-counted array that we can
pass around freely.The flow is similar to working with native interop in C#:
IntPtr → Span<float> → float[].
You start with an unsafe pointer from a native API, wrap it in something with bounds checking,
then optionally copy to a managed array. In Swift, "unsafe" is right there in the type name
(UnsafeBufferPointer) as a reminder that you are leaving safe territory.
Copying 1024 floats into a new array is 4 KB of data — trivial. Even at 4096 samples, that
is only 16 KB, which fits comfortably in L1 cache. For our use case, the copy is a non-issue.
In extreme performance scenarios, you could work directly with the
UnsafeBufferPointer and avoid the copy entirely, but that adds complexity with
no measurable benefit here.
The changes to AudioEngine.swift are minimal. Here is what is different:
import AVFoundation
import Observation
import Accelerate // ← NEW: import the framework
The tap callback now extracts samples before calling the processing function:
input.installTap(onBus: 0, bufferSize: 1024, format: format) { [weak self] buffer, _ in
// Extract samples from the audio buffer
guard let channelData = buffer.floatChannelData?[0] else { return }
let samples = Array(UnsafeBufferPointer(start: channelData,
count: Int(buffer.frameLength)))
let rms = Self.computeRMS(samples: samples)
let normalized = Self.normalize(rms)
DispatchQueue.main.async {
self?.level = normalized
}
}
And the computeRMS function uses vDSP:
private static func computeRMS(samples: [Float]) -> Float {
guard !samples.isEmpty else { return 0 }
var sum: Float = 0
vDSP_svesq(samples, 1, &sum, vDSP_Length(samples.count))
return sqrt(sum / Float(samples.count))
}
Everything else stays the same. The normalize function, the start()
and stop() methods, the properties — all unchanged. The ContentView
does not need any modifications at all.
Notice how cleanly this refactor went. We changed the implementation of one function without touching the interface. The view does not know or care whether RMS is computed with a for loop or a SIMD-accelerated function call. This separation — the audio engine handles DSP, the view handles display — will serve us well as the processing gets more complex.
Build and run after making these changes. The app should behave identically to before —
same VU meter, same responsiveness. If something broke, double-check that you added
import Accelerate and that the computeRMS signature now takes
samples: [Float] instead of buffer: AVAudioPCMBuffer.
The RMS replacement was a warm-up. In Section 5, we tackle the Fast Fourier Transform, and that is where Accelerate becomes essential rather than optional.
Consider what the FFT involves:
All of this must happen within the time between audio callbacks — about 23 ms at our buffer size. A naive, scalar implementation might take 5-10 ms. The Accelerate-powered version takes under 1 ms. That is the difference between a responsive app and one that drops audio frames.
Processing time per buffer (approximate)
Naive loops: ████████████████████░░░░░░░░░░ ~8ms
Accelerate/vDSP: ██░░░░░░░░░░░░░░░░░░░░░░░░░░░░ <1ms
Budget (23ms): ██████████████████████████████░ 23ms
Both fit within budget, but Accelerate leaves far more
headroom for the UI, animations, and other processing.
Accelerate also provides vDSP_fft_zrip — a single function call that performs the
entire FFT on a real-valued signal. We will use it in Section 5, along with several other vDSP
functions for windowing and magnitude calculation. The pattern will be the same as what we saw
here: replace loops with single function calls, pass arrays in, get results out.
vDSP function names look cryptic at first, but they follow a pattern:
vDSP_svesq — scalar = vector
elements squaredvDSP_vmul — vector multiplyvDSP_vsmul — vector scalar
multiplyvDSP_fft_zrip — fft, z = complex,
r = real input, in-placeOnce you crack the code, reading vDSP function names becomes almost natural. We will encounter several more in the next two sections.