Building a Wireless PA System with Rust and WebRTC: What Nobody Tells You
$ aplay -D default audio_test.wav
# works fine on the bench
# completely falls apart at 12 metersI have a real problem with “it works on my machine” moments. In software, that phrase is annoying. In a physical PA system, it means someone’s speech gets cut off mid-sentence in front of an audience.
So when I decided to build a wireless PA system — real speakers, real rooms, real latency constraints — I wanted to do it properly. No off-the-shelf audio streaming library. No Python glue. Just Rust, WebRTC, and a handful of Raspberry Pi boards acting as receivers.
This post is about what that actually looks like. Not the happy path. The real path.
Why WebRTC? Why Not Just UDP?
This is the question I get immediately whenever I explain this project. UDP multicast exists. RTSP exists. You can stream audio with FFmpeg in two commands. Why reach for WebRTC — a browser protocol — for a hardware PA system?
Because I’d tried the simpler approaches, and they had real failure modes.
Raw UDP multicast works great until your network has any congestion at all. When packets drop, audio drops. There’s no recovery. You hear a click or a gap, and it’s gone. For background music, maybe that’s acceptable. For live announcements, it’s a disaster.
RTSP is better, but it assumes a server-client pull model. Each receiver connects and pulls a stream. Under load, you get desync between receivers — one unit is 200ms behind another. In a room with multiple speakers, that echo is immediately noticeable and highly distracting.
WebRTC was designed for exactly the problems I face: real-time audio, jitter buffering, packet loss concealment, the Opus codec with built-in error resilience, and a transport layer (SRTP over ICE/DTLS) that actually adapts to network conditions. The fact that it’s usually described as a “browser technology” is a branding problem, not a technical limitation.
The only real technical limitation is that WebRTC is genuinely complex to implement from scratch.
Choosing the Rust Crate: str0m vs webrtc-rs
When you search for WebRTC in Rust, two crates dominate the landscape: webrtc-rs and str0m. I spent more time evaluating these than I’d like to admit.
webrtc-rs
webrtc-rs is a port of the Google WebRTC library’s logic into Rust. It’s comprehensive and covers almost everything the spec requires. However, it’s also clearly a port — the API feels like C++ thinking expressed in Rust syntax. Lifetimes show up in places that feel awkward, the async model is Tokio-centric but occasionally fights you, and the compile times are significant.
str0m
str0m takes a completely different approach. It’s written from scratch as a sans-I/O library. It owns no sockets, no threads, and no async runtime. You feed it bytes. It gives you back bytes and events. That’s it. You wire up the actual I/O yourself.
At first, that sounds like more work. In practice, it’s the exact opposite. Because str0m makes no assumptions about your runtime, it composes cleanly with any async executor. Because it owns no sockets, you control exactly how packets flow. For an embedded target like a Raspberry Pi — where you want a lean, deterministic binary — this architectural decision matters a lot.
I went with str0m. Here’s the basic shape of how it integrates:
use str0m::{Rtc, Input, Output, Event};
use str0m::net::Receive;
use std::net::UdpSocket;
fn run_receiver(socket: UdpSocket) -> anyhow::Result<()> {
let mut rtc = Rtc::builder()
.set_ice_lite(true) // receivers don't need full ICE
.build();
let mut buf = vec![0u8; 2048];
loop {
// Poll str0m for what to do next
match rtc.poll_output()? {
Output::Timeout(deadline) => {
// Sleep until the deadline or until a packet arrives
let now = wait_until(deadline, &socket, &mut buf)?;
rtc.handle_input(Input::Timeout(now))?;
}
Output::Transmit(transmit) => {
socket.send_to(&transmit.contents, transmit.destination)?;
}
Output::Event(event) => {
handle_event(event)?;
}
}
}
}The poll_output / handle_input loop is the core pattern. str0m tells you what it needs — either a timeout tick, a packet to transmit, or it surfaces an event (ICE connected, media arrived, etc.). You respond. This is the sans-I/O model in action, and once it clicks, it’s elegant.
The Audio Pipeline: Symphonia + Opus
WebRTC mandates Opus as the audio codec. That’s actually a great thing — Opus is excellent for voice and handles packet loss gracefully via its built-in PLC (packet loss concealment). The challenge is getting audio from your source format into Opus frames, and decoding them on the receiver side.
For decoding on the receiver (Raspberry Pi), I use the opus crate bindings directly:
use opus::{Decoder, Channels, Application};
struct AudioReceiver {
decoder: Decoder,
// Output sample rate — Pi's ALSA usually wants 48000
sample_rate: u32,
}
impl AudioReceiver {
fn new() -> anyhow::Result<Self> {
let decoder = Decoder::new(48_000, Channels::Stereo)?;
Ok(Self { decoder, sample_rate: 48_000 })
}
fn decode_rtp_payload(&mut self, payload: &[u8]) -> anyhow::Result<Vec<i16>> {
// Opus frames in WebRTC are 20ms at 48kHz = 960 samples per channel
let max_samples = 960 * 2; // stereo
let mut pcm = vec![0i16; max_samples];
let decoded = self.decoder.decode(payload, &mut pcm, false)?;
pcm.truncate(decoded * 2);
Ok(pcm)
}
}On the sender side — the server that captures microphone input and broadcasts it — I use Symphonia for decoding the source audio (if it comes from a file or encoded stream) and the opus crate for encoding before handing frames to str0m:
use opus::{Encoder, Channels, Application};
use str0m::media::{MediaData, MediaKind};
struct AudioSender {
encoder: Encoder,
}
impl AudioSender {
fn new() -> anyhow::Result<Self> {
let mut encoder = Encoder::new(48_000, Channels::Stereo, Application::Voip)?;
// Voip application mode — Opus will tune itself for speech intelligibility
// over Audio quality. For music, use Application::Audio instead.
encoder.set_bitrate(opus::Bitrate::Bits(64_000))?;
Ok(Self { encoder })
}
fn encode_frame(&mut self, pcm: &[i16]) -> anyhow::Result<Vec<u8>> {
// 960 samples per channel = 20ms at 48kHz
let mut output = vec![0u8; 4000];
let len = self.encoder.encode(pcm, &mut output)?;
output.truncate(len);
Ok(output)
}
}The Application::Voip vs Application::Audio choice matters more than it sounds. Voip mode enables Opus’s voice activity detection, comfort noise generation, and prioritizes speech clarity at lower bitrates. If your PA system carries both announcements and background music, you’ll want to switch modes dynamically — or run two encoders.
What Symphonia Is Actually For
A note on where Symphonia fits, because it confused me initially. Symphonia is a pure-Rust audio decoding library. It handles reading MP3, FLAC, WAV, AAC, and other container/codec formats. It’s not involved in encoding or in the WebRTC pipeline directly.
Where it matters in this project: if your audio source isn’t raw PCM from a microphone, you need to demux and decode it before you can feed it to Opus. If you’re playing a stored MP3 announcement, Symphonia reads that file:
use symphonia::core::io::MediaSourceStream;
use symphonia::core::probe::Hint;
use symphonia::core::formats::FormatOptions;
use symphonia::core::meta::MetadataOptions;
use symphonia::core::audio::SampleBuffer;
use symphonia::core::codecs::DecoderOptions;
fn decode_audio_file(path: &str) -> anyhow::Result<Vec<i16>> {
let src = std::fs::File::open(path)?;
let mss = MediaSourceStream::new(Box::new(src), Default::default());
let mut hint = Hint::new();
hint.with_extension("mp3");
let probed = symphonia::default::get_probe().format(
&hint,
mss,
&FormatOptions::default(),
&MetadataOptions::default(),
)?;
let mut format = probed.format;
let track = format.default_track().unwrap();
let mut decoder = symphonia::default::get_codecs()
.make(&track.codec_params, &DecoderOptions::default())?;
let mut pcm_samples: Vec<i16> = Vec::new();
loop {
let packet = match format.next_packet() {
Ok(p) => p,
Err(_) => break,
};
let decoded = decoder.decode(&packet)?;
let spec = *decoded.spec();
let mut sample_buf = SampleBuffer::<i16>::new(decoded.capacity() as u64, spec);
sample_buf.copy_interleaved_ref(decoded);
pcm_samples.extend_from_slice(sample_buf.samples());
}
Ok(pcm_samples)
}Then that Vec<i16> gets chunked into 960-sample frames and fed to the Opus encoder. Symphonia → Opus → str0m → UDP → receiver → Opus decoder → ALSA. That’s the full chain.
Latency: The Real Constraint
This is the part that humbles you. Theoretical latency and perceived latency are different things.
Opus at 20ms frames introduces 20ms of algorithmic delay minimum. Then you have network jitter buffering — typically 60–120ms on a local WiFi network. Then ALSA’s hardware buffer on the receiver side adds another 20–50ms depending on how you configure it. You’re realistically looking at 100–200ms end-to-end on a good local network.
For a PA system in a single room, 100ms is noticeable but acceptable. For a multi-room system where someone can see the speaker and also hear the PA — it’s immediately obvious and disorienting.
Here are the knobs I found that actually move the needle:
- Reduce the jitter buffer target.
str0mexposes jitter buffer configuration. The default is conservative. Lowering the target delay helps latency at the cost of more audible glitches on a noisy network. On a dedicated SSID (which is the right call for a production PA system), you can afford to go aggressive. - Use smaller Opus frames. Opus supports 2.5ms, 5ms, 10ms, and 20ms frame sizes. Smaller frames reduce algorithmic delay. They also increase packet overhead and are harder on the encoder. I found 10ms to be a reasonable middle ground for a controlled environment.
- Minimize the ALSA buffer. On the Raspberry Pi receiver:
use alsa::pcm::{PCM, HwParams, Format, Access};
use alsa::Direction;
fn open_alsa_output() -> anyhow::Result<PCM> {
let pcm = PCM::new("default", Direction::Playback, false)?;
let hwp = HwParams::any(&pcm)?;
hwp.set_channels(2)?;
hwp.set_rate(48_000, alsa::ValueOr::Nearest)?;
hwp.set_format(Format::s16())?;
hwp.set_access(Access::RWInterleaved)?;
// This is the critical one — smaller buffer = lower latency
// but if your processing can't keep up, you get underruns
hwp.set_buffer_size(1024)?;
hwp.set_period_size(256, alsa::ValueOr::Nearest)?;
pcm.hw_params(&hwp)?;
Ok(pcm)
}Setting buffer_size to 1024 samples at 48kHz gives you ~21ms of ALSA buffering. Go lower and you’ll get underruns unless your decoding is consistently fast enough — which, in Rust with a static binary, it generally is.
The Feedback Problem
Acoustic feedback — that piercing squeal — happens when a microphone picks up output from a nearby speaker and re-amplifies it in a loop. In a traditional PA system, you manage this with careful speaker placement and EQ notch filters. In a software-defined system, you have more options.
I haven’t solved this completely. I’ll be honest about that. But here’s where I’ve landed:
Approach 1: Topology Isolation
The simplest solution is architectural — put the microphone and the transmitter node in a room with no speakers, or point them away from all speaker outputs. The Raspberry Pi receivers act as the speakers, while the sender remains isolated. No feedback path exists because there’s physical separation. This works wonderfully and is the right answer in most installations.
Approach 2: AEC in Software
WebRTC actually ships an acoustic echo canceller in its audio processing module. Getting this into a Rust pipeline without pulling in the whole Google WebRTC C++ library is non-trivial. webrtc-audio-processing is a Rust crate that wraps the APM (Audio Processing Module) from the Chromium project. It’s linkage-heavy but functional:
// Conceptual — webrtc-audio-processing integration
use webrtc_audio_processing::{Processor, Config, InitializationConfig};
let config = Config {
echo_cancellation: Some(Default::default()),
..Default::default()
};
let mut processor = Processor::new(&InitializationConfig {
num_capture_channels: 1,
num_render_channels: 2,
..Default::default()
})?;
processor.set_config(config);
// Feed render (speaker) audio as reference signal
processor.process_render_frame(&mut render_buf)?;
// Process capture (mic) audio with echo reference in context
processor.process_capture_frame(&mut capture_buf)?;
// capture_buf now has echo suppressedThis is the approach if you absolutely need AEC in software. The caveat is that it requires the render signal (what the speakers are playing) to be routed back to the AEC processor as a reference — which means your sender node needs to know exactly what the receivers are playing. That’s an architectural constraint worth designing for from the very start.
Raspberry Pi Deployment: Making It Lean
The whole point of doing this in Rust is that the final binary is incredibly small and self-contained. On the Raspberry Pi receiver side, the stripped binary is around 8MB. It runs flawlessly as a systemd service:
# /etc/systemd/system/pa-receiver.service
[Unit]
Description=PA System Receiver
After=network-online.target sound.target
Wants=network-online.target
[Service]
Type=simple
ExecStart=/usr/local/bin/pa-receiver --server 192.168.10.1:8443 --device default
Restart=always
RestartSec=3
User=audio
SupplementaryGroups=audio
[Install]
WantedBy=multi-user.targetCross-compilation from a development machine to aarch64-unknown-linux-gnu works seamlessly with cross:
cross build --target aarch64-unknown-linux-gnu --release
scp target/aarch64-unknown-linux-gnu/release/pa-receiver pi@receiver-1:/usr/local/bin/No Python runtime. No Node. No Docker containers. One statically linked binary, one systemd unit, and you’re done.
What’s Still Not Solved
I said this was the real path, not the happy path. Here’s what’s still on my list to fix:
- Reliable ICE establishment across subnets: On a controlled LAN this is fine — ICE lite on the receiver and a known server IP sidesteps most of the ICE complexity. But the moment you cross subnets, you need proper STUN/TURN infrastructure.
- Graceful reconnection: If a receiver loses network connectivity briefly and rejoins, the current implementation requires a full WebRTC re-negotiation. This takes 2–3 seconds, resulting in a noticeable gap in audio. A proper fix involves keeping the ICE credentials alive and attempting to restart the connection transparently.
- Dynamic speaker management UI: Right now, adding or removing receiver nodes requires restarting the sender. A proper production system needs a control plane — likely a small HTTP API on the sender side that manages active sessions dynamically.
The Bottom Line
Building a PA system in Rust with WebRTC is not the easy path. The easy path is FFmpeg piped into a UDP multicast group, which kind of works but comes with severe limitations.
But if you want a system that handles packet loss gracefully, decodes audio reliably on constrained hardware, ships as a lean binary with zero runtime dependencies, and gives you total control over every single layer of the audio pipeline — Rust is unequivocally the right tool for the job.
str0m’s sans-I/O model was the right choice. It composes cleanly, it doesn’t fight your async runtime, and it makes managing WebRTC complexity feel completely doable. Opus is genuinely world-class at handling real-time audio constraints so you don’t have to.
And a Raspberry Pi running on aarch64-unknown-linux-gnu with ALSA is a perfectly capable speaker endpoint. The entire system — one sender server plus multiple Pi receivers — costs less than a commercial wireless PA unit, sounds comparable, and is 100% under your control.
$ systemctl status pa-receiver
● pa-receiver.service - PA System Receiver
Loaded: loaded (/etc/systemd/system/pa-receiver.service; enabled)
Active: active (running) since ...That’s the goal. And it’s mostly there.
Working on a similar hardware project or have questions about the str0m integration? Reach out — qcynaut@gmail.com