Summary
Calling await room.sid from inside an agent's JobContext entrypoint occasionally hangs forever, even though the LiveKit server has assigned a SID to the room (visible in Cloud Analytics). Other FFI-driven events for the same Room instance (participant_connected, track_published, connection_quality_changed, SIP hangup → participant_disconnected) continue to flow normally throughout the hang.
The caller experiences silent dial-tone — the agent never publishes audio — until they give up and hang up.
Environment
livekit==1.1.5
livekit-agents==1.5.6
- LiveKit Cloud, SIP-inbound voice agents, ~10 concurrent sessions per worker pod
Mechanism
Room._first_sid_future is created in Room.__init__ and is only resolved by the room_sid_changed FFI event handler:
```python
rtc/room.py:189
@Property
async def sid(self) -> str:
if self._info.sid:
return self._info.sid
return await self._first_sid_future
rtc/room.py:770
elif which == "room_sid_changed":
if not self._info.sid:
self._first_sid_future.set_result(event.room_sid_changed.sid)
self._info.sid = event.room_sid_changed.sid
```
In the failure case, _first_sid_future is never set. No exception, no log — the awaiter just suspends forever. Meanwhile the same Room instance continues to receive and dispatch all other FFI events.
Recurrence
- 21 confirmed occurrences over 30 days in one production deployment (~0.7/day)
- Rate is steady — not correlated with any deploy or service version
- Affects multiple pods, multiple service versions, multiple SIP callers
- Each occurrence costs one customer call (caller hangs up at ~60s SIP ringback timeout)
Evidence the SID exists server-side
For affected sessions, the LiveKit Cloud Analytics REST API (`GET /api/project//sessions/`) returns the room with the expected SID. The server assigned it; only the SDK's local future never resolves.
Suspected cause
Race or lost event in the FFI event queue / event-loop bridge specific to `room_sid_changed`. We haven't reproduced it in development yet — the rate is too low.
Workaround we're applying
```python
try:
sid = await asyncio.wait_for(room.sid, timeout=5.0)
except asyncio.TimeoutError:
# log + fall back; abort the job so the worker frees up
...
```
Converts the silent hang into a recoverable error.
What would help from upstream
- Confirmation of whether this is a known/possible failure mode
- An internal timeout on `_first_sid_future` after `connect()` succeeds, raising `RuntimeError` rather than hanging — so callers don't have to arm their own timeout
- Optionally, a way to detect "room connected but SID unresolved" without polling
Happy to share anonymized trace fingerprints if useful.
Summary
Calling
await room.sidfrom inside an agent'sJobContextentrypoint occasionally hangs forever, even though the LiveKit server has assigned a SID to the room (visible in Cloud Analytics). Other FFI-driven events for the sameRoominstance (participant_connected,track_published,connection_quality_changed, SIPhangup→participant_disconnected) continue to flow normally throughout the hang.The caller experiences silent dial-tone — the agent never publishes audio — until they give up and hang up.
Environment
livekit==1.1.5livekit-agents==1.5.6Mechanism
Room._first_sid_futureis created inRoom.__init__and is only resolved by theroom_sid_changedFFI event handler:```python
rtc/room.py:189
@Property
async def sid(self) -> str:
if self._info.sid:
return self._info.sid
return await self._first_sid_future
rtc/room.py:770
elif which == "room_sid_changed":
if not self._info.sid:
self._first_sid_future.set_result(event.room_sid_changed.sid)
self._info.sid = event.room_sid_changed.sid
```
In the failure case,
_first_sid_futureis never set. No exception, no log — the awaiter just suspends forever. Meanwhile the sameRoominstance continues to receive and dispatch all other FFI events.Recurrence
Evidence the SID exists server-side
For affected sessions, the LiveKit Cloud Analytics REST API (`GET /api/project//sessions/`) returns the room with the expected SID. The server assigned it; only the SDK's local future never resolves.
Suspected cause
Race or lost event in the FFI event queue / event-loop bridge specific to `room_sid_changed`. We haven't reproduced it in development yet — the rate is too low.
Workaround we're applying
```python
try:
sid = await asyncio.wait_for(room.sid, timeout=5.0)
except asyncio.TimeoutError:
# log + fall back; abort the job so the worker frees up
...
```
Converts the silent hang into a recoverable error.
What would help from upstream
Happy to share anonymized trace fingerprints if useful.