2.10. Audio¶

Audio is carried as one or more WebRTC media tracks (RTP / SRTP) negotiated in the same SDP exchange that creates the data channels described in Data Transfer. Each track carries Opus, with parameters published to the client in the AudioConfig (inside SetupCommand) block of SetupCommand.

In a multi-client (room) session the server acts as a Selective Forwarding Unit (SFU): each client’s microphone arrives at the server on one inbound track, and the server forwards a subset of those tracks to every other client as separate outbound tracks. Mapping of tracks to source client uids and admission decisions are signalled over the reliable data channel.

2.10.1. Codec and RTP parameters¶

Audio uses Opus (RFC 6716, payload format RFC 7587) with the following defaults, all of which may be changed via AudioConfig (inside SetupCommand):

Parameter	Default	Notes
RTP payload type	111	Dynamic; advertised in the SDP `a=rtpmap` attribute.
Sample rate	48000 Hz	The Opus clock-rate. Decoder may resample for playback.
Channels	1	2 (stereo) is permitted for music-grade sources.
Frame duration	20 ms	Permitted: 10, 20, 40, 60 ms.
In-band FEC	on	Allows the decoder to reconstruct a lost packet from the next one.
DTX	on	Discontinuous transmission during silence.

Implementations MUST advertise these in SDP (a=fmtp:111 useinbandfec=1;usedtx=1;…) and MUST honour the values received from the peer in the answered SDP.

2.10.2. Topology¶

For a session with N participants the server provisions transceivers per peer as follows. P denotes the per-listener cap maxInboundStreams.

Transceiver	Direction (server view)	Purpose
1 × per peer	`recvonly`	The peer’s microphone arriving at the server.
`min(P, N-1)` × per peer	`sendonly`	One outbound voice per other peer that the SFU has selected for this listener.

Tracks are identified end-to-end by their SDP mid attribute. The mapping mid → sourceClientUid is delivered on the reliable channel via the AudioSourceMapping command command; clients MUST NOT rely on parsing a=msid or any other SDP attribute for source identification.

A client that does not provide microphone input still receives sendonly transceivers from the server (it is a listener); it may negotiate inactive on its own outbound m-line.

2.10.3. `AudioConfig` (inside `SetupCommand`)¶

A 17-byte block inside Setup Command describing the audio configuration the server will use for this session. Clients MUST configure their decoder and microphone path to match.

Table 2.61 AudioConfig¶
Bytes	Type	Description
1	uint8	`codec`. `0` = audio disabled (no media tracks will be negotiated); `1` = Opus. Other values reserved.
1	uint8	`rtpPayloadType` (0–127). Must match the value in SDP.
4	uint32	`sampleRateHz`. 48000 for Opus.
1	uint8	`channelCount`. 1 or 2.
1	uint8	`frameDurationMs`. 10, 20, 40 or 60.
1	uint8	`flags`. Bit 0: in-band FEC. Bit 1: DTX. Bit 2: symmetric routing (see Selection policy and caps). Other bits reserved, MUST be zero.
1	uint8	`maxInboundStreams`. Per-listener cap. `0` means “no limit”; otherwise the SFU will forward at most this many concurrent voices to this client.
1	uint8	`selectionPolicy`. `0` = `All` (no selection, requires `maxInboundStreams == 0`), `1` = `Fifo`, `2` = `Proximity`, `3` = `ActiveSpeaker`, `4` = `Custom` (server-side, opaque to client). See Selection policy and caps.
4	float	`proximityRadiusMetres`. Used only when `selectionPolicy == Proximity`; informational for other policies.
2	uint16	`evictionGraceMs`. Hysteresis applied by the SFU before evicting a peer that has fallen out of the selected set. `0` disables hysteresis.

If codec == 0 no audio media tracks are present in the SDP, no AudioSourceMapping or AudioParticipantStateChange commands will be sent, and any client microphone state is ignored.

SetupCommand.audio_input_enabled remains the gate on client-to-server microphone capture (the inbound transceiver on the server is set to inactive if it is zero).

2.10.4. Selection policy and caps¶

When the room has more potential speakers than maxInboundStreams, the SFU chooses which sources each listener hears according to selectionPolicy:

Policy	Rule
`All`	No selection: forward every other participant to every listener. Requires `maxInboundStreams == 0`.
`Fifo`	Forward the first `maxInboundStreams` peers (by join order) to every listener.
`Proximity`	Forward the `maxInboundStreams` peers whose avatars are closest to the listener’s avatar in world space, subject to `proximityRadiusMetres`.
`ActiveSpeaker`	Forward the `maxInboundStreams` peers with the highest recent audio energy.
`Custom`	Selection is performed by application code on the server. Clients treat the resulting AudioSourceMapping command updates as authoritative.

When the symmetric routing flag (AudioConfig.flags bit 2) is set, the SFU guarantees that if A is in B’s selected set then B is in A’s selected set; this may cause the actual forwarded count to exceed maxInboundStreams by at most one per pair affected.

The SFU MUST NOT forward a participant’s own microphone back to them (loopback suppression).

Selection is recomputed on a server-defined cadence and on every join/leave. To avoid UI thrash on a peer hovering at the selection boundary, the server SHOULD apply the evictionGraceMs hysteresis before removing a transceiver that has just dropped out of the selected set.

2.10.5. `AudioSourceMapping` command¶

Sent by the server to a client whenever the set of audio tracks delivered to that client changes (a peer joined, left, was admitted by the selection policy, or was evicted). Carried on the reliable channel as a standard server-to-client command.

Table 2.62 AudioSourceMapping¶
Bytes	Type	Description
1	CommandPayloadType	`AudioSourceMapping` (id assigned in Commands from Server to Client).
2	uint16	`addedCount` = A.
2	uint16	`removedCount` = R.
variable	AddedEntry[A]	Each: `uint8 midLength`, `midLength` UTF-8 bytes (SDP `mid`), `uint64 sourceClientUid`.
variable	RemovedEntry[R]	Each: `uint8 midLength`, `midLength` UTF-8 bytes.

A client MUST treat the mapping as cumulative state: an Added entry whose mid is already known replaces the existing sourceClientUid; a Removed entry whose mid is unknown is ignored. If a mapping arrives for a mid whose transceiver is not yet known locally (renegotiation race), the client MUST buffer it and apply it when the transceiver appears.

The very first AudioSourceMapping of a session may be sent with A == 0 and R == 0 to mean “the audio subsystem is ready; no peers are currently selected”.

2.10.6. `AudioParticipantStateChange` command¶

Sent by the server to inform a client of changes to the audio state of other participants whose presence is otherwise visible (i.e. they are nodes in the scene). This is distinct from AudioSourceMapping command, which describes the transport-level set; AudioParticipantStateChange describes intent and is used to render UI (“out of range”, “muted”, “left”) without the user mis-attributing silence to a fault.

Table 2.63 AudioParticipantStateChange¶
Bytes	Type	Description
1	CommandPayloadType	`AudioParticipantStateChange` (id assigned in Commands from Server to Client).
2	uint16	`updateCount` = N.
10 × N	Update[N]	Each: `uint64 sourceClientUid`, `uint8 state`, `uint8 reason`.

state:

0 Streaming — audio is being forwarded to this listener.
1 Culled — known participant excluded by the selection policy.
2 Disabled — participant is in the room but their microphone is off / muted by application policy.
3 Left — participant has disconnected.

reason is informational and may be 0 (none). Defined non-zero values: 1 ProximityOut, 2 CapExceeded, 3 PolicyEvicted, 4 ServerMuted, 5 SelfMuted.

2.10.7. Join and leave¶

When peer X joins a room that already contains peers Y₁, …, Y_k:

The server adds, on X’s PeerConnection: one recvonly transceiver for X’s microphone, plus up to maxInboundStreams sendonly transceivers carrying the SFU-selected subset of {Y_i}.
For each Y_i whose selection set now contains X, the server adds one sendonly transceiver on Y_i’s PeerConnection and triggers renegotiation per Signaling.
The server sends AudioSourceMapping to X (listing all admitted Y_i) and to every Y_i whose set changed.
The server sends AudioParticipantStateChange to surface the user-visible state changes.

When peer X leaves, the reverse: outbound transceivers carrying X are stopped on every affected peer, AudioSourceMapping carries the removed mids, and AudioParticipantStateChange carries Left for X.

2.10.8. Example¶

A 3-peer room with codec=Opus, maxInboundStreams=2, selectionPolicy=Proximity, symmetric routing on:

Peer A's PeerConnection:        Peer B's PeerConnection:        Peer C's PeerConnection:
  mid=0  recvonly  (A's mic)      mid=0  recvonly  (B's mic)      mid=0  recvonly  (C's mic)
  mid=1  sendonly  (← B)          mid=1  sendonly  (← A)          mid=1  sendonly  (← A)
  mid=2  sendonly  (← C)          mid=2  sendonly  (← C)          mid=2  sendonly  (← B)

AudioSourceMapping to A: added {mid=1→B.uid, mid=2→C.uid}
AudioSourceMapping to B: added {mid=1→A.uid, mid=2→C.uid}
AudioSourceMapping to C: added {mid=1→A.uid, mid=2→B.uid}

2.10.9. Lifecycle¶

Audio media tracks are negotiated as part of the initial SDP offer/answer described in Signaling. They become active as soon as DTLS-SRTP completes for that bundle; there is no separate StartAudio command. ShutdownCommand and any transport-level close end all audio tracks.

Mid-session reconfiguration of codec, sample rate or channel count is not supported: changes to AudioConfig (inside SetupCommand) require a new SetupCommand (i.e. a fresh session). Changes to maxInboundStreams, selectionPolicy, proximityRadiusMetres and evictionGraceMs MAY be applied at runtime by issuing a fresh SetupCommand with the same session_id; in this case clients MUST re-apply the new policy parameters without dropping cached state.

Navigation