fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe by mafredri · Pull Request #25433 · coder/coder

mafredri · 2026-05-18T11:45:24Z

The merge goroutine in SubscribeAuthorized tracks delivered messages with a monotonic lastMessageID and keys both the durable cache lookup and the DB fallback off it. When publishMessage runs concurrently from PromoteQueued and the worker that resumes after the promote, the cache can be populated with the worker's higher-ID resumed-assistant message before the lower-ID promoted user message; PG NOTIFY commit reordering on a busy connection pool can also flip the order in which the subscriber sees notifies. Either way lastMessageID advances past the lower ID and the late notify rescans nothing, so the promoted user message is silently dropped from the live stream.

Look up the durable cache from min(notify.AfterMessageID, lastMessageID) and dedupe deliveries through a set seeded from the initial DB snapshot, so the gap range can be rescanned without re-emitting messages we already sent. The DB fallback now uses the same lookupAfter so it can also backfill messages missing from the cache.

Closes CODAGT-357.

Investigation notes

Root cause was located by reproducing TestPromoteQueuedWhileRequiresAction under stress on this workspace (N=48 parallel test binaries, 40 rounds each) and instrumenting cacheDurableMessage plus the merge goroutine's notify branch. Every failure in 14/1900 runs shows the same pattern:

cacheDurableMessage id=4 role=assistant cache_ids=[4]
cacheDurableMessage id=5 role=tool      cache_ids=[4 5]
cacheDurableMessage id=7 role=assistant cache_ids=[4 5 7]   <- worker's resumed assistant
notify after_message_id=6  lastMessageID=4
notify cache lookup        lastMessageID=4 cached_ids=[5 7]
delivering id=5 role=tool
delivering id=7 role=assistant
cacheDurableMessage id=6 role=user      cache_ids=[4 5 7 6]   <- promoted, too late
notify after_message_id=4  lastMessageID=7
notify cache lookup        lastMessageID=7 cached_ids=[]
notify after_message_id=5  lastMessageID=7
notify cache lookup        lastMessageID=7 cached_ids=[]

dlv exec traces against the unfixed and fixed binaries (deterministic dbmock reproducer) confirm goroutine-by-goroutine:

Unfixed: cache fills [4, 5, 7], merge delivers 5, 7, then 6 lands; the next lookup at lastMessageID=7 returns [], so 6 is never emitted.

Fixed: same cache fill, 5 and 7 are delivered with lookupAfter=4. On the late notify with AfterMessageID=5, lookupAfter=min(7,5)=5 rescans the cache, the delivered set skips the already-emitted 7, 6 is delivered.

Validation:

go test ./coderd/x/chatd and ./enterprise/coderd/x/chatd pass.
go test -race ./coderd/x/chatd -run TestSubscribe\|TestPromoteQueued passes.
Stress harness N=48 ROUNDS=40 (1920 runs): 0 failures, was ~7% before the change.
New deterministic test TestSubscribeDeliversOutOfOrderDurableMessage in coderd/x/chatd/subscribe_out_of_order_internal_test.go fails in ~2s without the fix and passes with it.

A producer-side alternative (per-chat publish mutex held across PromoteQueued's post-TX publishes) would also close the producer race, but a consumer-side fix is more general because PG NOTIFY commit reordering can produce out-of-order delivery even without producer overlap. Happy to layer both if you'd prefer.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

PromoteQueued (caches synth + promoted) and the resumed worker (caches its assistant message) race against each other on the same per-chat durableMessages slice. The PG NOTIFY for synth can block under load while the worker commits its assistant first, so the cache ends up appending in [synth, resumed_assistant, promoted] order. The merge goroutine in SubscribeAuthorized then delivers synth and resumed_assistant from cache, advances lastMessageID past the promoted user ID, and silently drops promoted both on the cache rescan and on the DB fallback (also keyed by lastMessageID). Use min(notify.AfterMessageID, lastMessageID) for both lookups and dedupe via a delivered-ID set seeded from the initial DB snapshot. This rescans the gap range without re-emitting messages we already delivered, and the DB fallback now uses the same lookupAfter so it backfills any messages missing from the cache too. Adds subscribe_out_of_order_internal_test.go which deterministically reproduces the failure via dbmock; without the fix it times out at the promoted user assertion, with the fix it passes. Closes CODAGT-357.

github-actions Bot assigned mafredri May 18, 2026

mafredri force-pushed the mathias/codagt-357-flake-testpromotequeuedwhilerequiresaction branch from 87e5059 to 6e1e5e0 Compare May 18, 2026 11:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433

fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433
mafredri wants to merge 1 commit into
mainfrom
mathias/codagt-357-flake-testpromotequeuedwhilerequiresaction

mafredri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mafredri commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant