2.10. Audio¶
Audio is carried as one or more WebRTC media tracks (RTP / SRTP) negotiated in the same SDP exchange that creates the data channels described in Data Transfer. Each track carries Opus, with parameters published to the client in the AudioConfig (inside SetupCommand) block of SetupCommand.
In a multi-client (room) session the server acts as a Selective Forwarding Unit (SFU): each client’s microphone arrives at the server on one inbound track, and the server forwards a subset of those tracks to every other client as separate outbound tracks. Mapping of tracks to source client uids and admission decisions are signalled over the reliable data channel.
2.10.1. Codec and RTP parameters¶
Audio uses Opus (RFC 6716, payload format RFC 7587) with the following defaults, all of which may be changed via AudioConfig (inside SetupCommand):
Parameter |
Default |
Notes |
|---|---|---|
RTP payload type |
111 |
Dynamic; advertised in the SDP |
Sample rate |
48000 Hz |
The Opus clock-rate. Decoder may resample for playback. |
Channels |
1 |
2 (stereo) is permitted for music-grade sources. |
Frame duration |
20 ms |
Permitted: 10, 20, 40, 60 ms. |
In-band FEC |
on |
Allows the decoder to reconstruct a lost packet from the next one. |
DTX |
on |
Discontinuous transmission during silence. |
Implementations MUST advertise these in SDP (a=fmtp:111 useinbandfec=1;usedtx=1;…) and MUST honour the values received from the peer in the answered SDP.
2.10.2. Topology¶
For a session with N participants the server provisions transceivers per peer as follows. P denotes the per-listener cap maxInboundStreams.
Transceiver |
Direction (server view) |
Purpose |
|---|---|---|
1 × per peer |
|
The peer’s microphone arriving at the server. |
|
|
One outbound voice per other peer that the SFU has selected for this listener. |
Tracks are identified end-to-end by their SDP mid attribute. The mapping mid → sourceClientUid is delivered on the reliable channel via the AudioSourceMapping command command; clients MUST NOT rely on parsing a=msid or any other SDP attribute for source identification.
A client that does not provide microphone input still receives sendonly transceivers from the server (it is a listener); it may negotiate inactive on its own outbound m-line.
2.10.3. AudioConfig (inside SetupCommand)¶
A 17-byte block inside Setup Command describing the audio configuration the server will use for this session. Clients MUST configure their decoder and microphone path to match.
Bytes |
Type |
Description |
|---|---|---|
1 |
uint8 |
|
1 |
uint8 |
|
4 |
uint32 |
|
1 |
uint8 |
|
1 |
uint8 |
|
1 |
uint8 |
|
1 |
uint8 |
|
1 |
uint8 |
|
4 |
float |
|
2 |
uint16 |
|
If codec == 0 no audio media tracks are present in the SDP, no AudioSourceMapping or AudioParticipantStateChange commands will be sent, and any client microphone state is ignored.
SetupCommand.audio_input_enabled remains the gate on client-to-server microphone capture (the inbound transceiver on the server is set to inactive if it is zero).
2.10.4. Selection policy and caps¶
When the room has more potential speakers than maxInboundStreams, the SFU chooses which sources each listener hears according to selectionPolicy:
Policy |
Rule |
|---|---|
|
No selection: forward every other participant to every listener. Requires |
|
Forward the first |
|
Forward the |
|
Forward the |
|
Selection is performed by application code on the server. Clients treat the resulting AudioSourceMapping command updates as authoritative. |
When the symmetric routing flag (AudioConfig.flags bit 2) is set, the SFU guarantees that if A is in B’s selected set then B is in A’s selected set; this may cause the actual forwarded count to exceed maxInboundStreams by at most one per pair affected.
The SFU MUST NOT forward a participant’s own microphone back to them (loopback suppression).
Selection is recomputed on a server-defined cadence and on every join/leave. To avoid UI thrash on a peer hovering at the selection boundary, the server SHOULD apply the evictionGraceMs hysteresis before removing a transceiver that has just dropped out of the selected set.
2.10.5. AudioSourceMapping command¶
Sent by the server to a client whenever the set of audio tracks delivered to that client changes (a peer joined, left, was admitted by the selection policy, or was evicted). Carried on the reliable channel as a standard server-to-client command.
Bytes |
Type |
Description |
|---|---|---|
1 |
CommandPayloadType |
|
2 |
uint16 |
|
2 |
uint16 |
|
variable |
AddedEntry[A] |
Each: |
variable |
RemovedEntry[R] |
Each: |
A client MUST treat the mapping as cumulative state: an Added entry whose mid is already known replaces the existing sourceClientUid; a Removed entry whose mid is unknown is ignored. If a mapping arrives for a mid whose transceiver is not yet known locally (renegotiation race), the client MUST buffer it and apply it when the transceiver appears.
The very first AudioSourceMapping of a session may be sent with A == 0 and R == 0 to mean “the audio subsystem is ready; no peers are currently selected”.
2.10.6. AudioParticipantStateChange command¶
Sent by the server to inform a client of changes to the audio state of other participants whose presence is otherwise visible (i.e. they are nodes in the scene). This is distinct from AudioSourceMapping command, which describes the transport-level set; AudioParticipantStateChange describes intent and is used to render UI (“out of range”, “muted”, “left”) without the user mis-attributing silence to a fault.
Bytes |
Type |
Description |
|---|---|---|
1 |
CommandPayloadType |
|
2 |
uint16 |
|
10 × N |
Update[N] |
Each: |
state:
0Streaming— audio is being forwarded to this listener.1Culled— known participant excluded by the selection policy.2Disabled— participant is in the room but their microphone is off / muted by application policy.3Left— participant has disconnected.
reason is informational and may be 0 (none). Defined non-zero values: 1 ProximityOut, 2 CapExceeded, 3 PolicyEvicted, 4 ServerMuted, 5 SelfMuted.
2.10.7. Join and leave¶
When peer X joins a room that already contains peers Y1, …, Yk:
The server adds, on X’s PeerConnection: one
recvonlytransceiver for X’s microphone, plus up tomaxInboundStreamssendonlytransceivers carrying the SFU-selected subset of {Yi}.For each Yi whose selection set now contains X, the server adds one
sendonlytransceiver on Yi’s PeerConnection and triggers renegotiation per Signaling.The server sends
AudioSourceMappingto X (listing all admitted Yi) and to every Yi whose set changed.The server sends
AudioParticipantStateChangeto surface the user-visible state changes.
When peer X leaves, the reverse: outbound transceivers carrying X are stopped on every affected peer, AudioSourceMapping carries the removed mids, and AudioParticipantStateChange carries Left for X.
2.10.8. Example¶
A 3-peer room with codec=Opus, maxInboundStreams=2, selectionPolicy=Proximity, symmetric routing on:
Peer A's PeerConnection: Peer B's PeerConnection: Peer C's PeerConnection:
mid=0 recvonly (A's mic) mid=0 recvonly (B's mic) mid=0 recvonly (C's mic)
mid=1 sendonly (← B) mid=1 sendonly (← A) mid=1 sendonly (← A)
mid=2 sendonly (← C) mid=2 sendonly (← C) mid=2 sendonly (← B)
AudioSourceMapping to A: added {mid=1→B.uid, mid=2→C.uid}
AudioSourceMapping to B: added {mid=1→A.uid, mid=2→C.uid}
AudioSourceMapping to C: added {mid=1→A.uid, mid=2→B.uid}
2.10.9. Lifecycle¶
Audio media tracks are negotiated as part of the initial SDP offer/answer described in Signaling. They become active as soon as DTLS-SRTP completes for that bundle; there is no separate StartAudio command. ShutdownCommand and any transport-level close end all audio tracks.
Mid-session reconfiguration of codec, sample rate or channel count is not supported: changes to AudioConfig (inside SetupCommand) require a new SetupCommand (i.e. a fresh session). Changes to maxInboundStreams, selectionPolicy, proximityRadiusMetres and evictionGraceMs MAY be applied at runtime by issuing a fresh SetupCommand with the same session_id; in this case clients MUST re-apply the new policy parameters without dropping cached state.