Skip to content

fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433

Draft
mafredri wants to merge 1 commit into
mainfrom
mathias/codagt-357-flake-testpromotequeuedwhilerequiresaction
Draft

fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433
mafredri wants to merge 1 commit into
mainfrom
mathias/codagt-357-flake-testpromotequeuedwhilerequiresaction

Conversation

@mafredri
Copy link
Copy Markdown
Member

The merge goroutine in SubscribeAuthorized tracks delivered messages with a monotonic lastMessageID and keys both the durable cache lookup and the DB fallback off it. When publishMessage runs concurrently from PromoteQueued and the worker that resumes after the promote, the cache can be populated with the worker's higher-ID resumed-assistant message before the lower-ID promoted user message; PG NOTIFY commit reordering on a busy connection pool can also flip the order in which the subscriber sees notifies. Either way lastMessageID advances past the lower ID and the late notify rescans nothing, so the promoted user message is silently dropped from the live stream.

Look up the durable cache from min(notify.AfterMessageID, lastMessageID) and dedupe deliveries through a set seeded from the initial DB snapshot, so the gap range can be rescanned without re-emitting messages we already sent. The DB fallback now uses the same lookupAfter so it can also backfill messages missing from the cache.

Closes CODAGT-357.

Investigation notes

Root cause was located by reproducing TestPromoteQueuedWhileRequiresAction under stress on this workspace (N=48 parallel test binaries, 40 rounds each) and instrumenting cacheDurableMessage plus the merge goroutine's notify branch. Every failure in 14/1900 runs shows the same pattern:

cacheDurableMessage id=4 role=assistant cache_ids=[4]
cacheDurableMessage id=5 role=tool      cache_ids=[4 5]
cacheDurableMessage id=7 role=assistant cache_ids=[4 5 7]   <- worker's resumed assistant
notify after_message_id=6  lastMessageID=4
notify cache lookup        lastMessageID=4 cached_ids=[5 7]
delivering id=5 role=tool
delivering id=7 role=assistant
cacheDurableMessage id=6 role=user      cache_ids=[4 5 7 6]   <- promoted, too late
notify after_message_id=4  lastMessageID=7
notify cache lookup        lastMessageID=7 cached_ids=[]
notify after_message_id=5  lastMessageID=7
notify cache lookup        lastMessageID=7 cached_ids=[]

dlv exec traces against the unfixed and fixed binaries (deterministic dbmock reproducer) confirm goroutine-by-goroutine:

Unfixed: cache fills [4, 5, 7], merge delivers 5, 7, then 6 lands; the next lookup at lastMessageID=7 returns [], so 6 is never emitted.

Fixed: same cache fill, 5 and 7 are delivered with lookupAfter=4. On the late notify with AfterMessageID=5, lookupAfter=min(7,5)=5 rescans the cache, the delivered set skips the already-emitted 7, 6 is delivered.

Validation:

  • go test ./coderd/x/chatd and ./enterprise/coderd/x/chatd pass.
  • go test -race ./coderd/x/chatd -run TestSubscribe\|TestPromoteQueued passes.
  • Stress harness N=48 ROUNDS=40 (1920 runs): 0 failures, was ~7% before the change.
  • New deterministic test TestSubscribeDeliversOutOfOrderDurableMessage in coderd/x/chatd/subscribe_out_of_order_internal_test.go fails in ~2s without the fix and passes with it.

A producer-side alternative (per-chat publish mutex held across PromoteQueued's post-TX publishes) would also close the producer race, but a consumer-side fix is more general because PG NOTIFY commit reordering can produce out-of-order delivery even without producer overlap. Happy to layer both if you'd prefer.

🤖 This PR was created with the help of Coder Agents, and will be reviewed by a human. 🏂🏻

PromoteQueued (caches synth + promoted) and the resumed worker
(caches its assistant message) race against each other on the
same per-chat durableMessages slice. The PG NOTIFY for synth can
block under load while the worker commits its assistant first, so
the cache ends up appending in [synth, resumed_assistant, promoted]
order. The merge goroutine in SubscribeAuthorized then delivers
synth and resumed_assistant from cache, advances lastMessageID
past the promoted user ID, and silently drops promoted both on
the cache rescan and on the DB fallback (also keyed by
lastMessageID).

Use min(notify.AfterMessageID, lastMessageID) for both lookups and
dedupe via a delivered-ID set seeded from the initial DB snapshot.
This rescans the gap range without re-emitting messages we already
delivered, and the DB fallback now uses the same lookupAfter so it
backfills any messages missing from the cache too.

Adds subscribe_out_of_order_internal_test.go which deterministically
reproduces the failure via dbmock; without the fix it times out at
the promoted user assertion, with the fix it passes.

Closes CODAGT-357.
@mafredri mafredri force-pushed the mathias/codagt-357-flake-testpromotequeuedwhilerequiresaction branch from 87e5059 to 6e1e5e0 Compare May 18, 2026 11:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant