fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433
Draft
mafredri wants to merge 1 commit into
Draft
fix(coderd/x/chatd): deliver out-of-order durable messages on subscribe#25433mafredri wants to merge 1 commit into
mafredri wants to merge 1 commit into
Conversation
PromoteQueued (caches synth + promoted) and the resumed worker (caches its assistant message) race against each other on the same per-chat durableMessages slice. The PG NOTIFY for synth can block under load while the worker commits its assistant first, so the cache ends up appending in [synth, resumed_assistant, promoted] order. The merge goroutine in SubscribeAuthorized then delivers synth and resumed_assistant from cache, advances lastMessageID past the promoted user ID, and silently drops promoted both on the cache rescan and on the DB fallback (also keyed by lastMessageID). Use min(notify.AfterMessageID, lastMessageID) for both lookups and dedupe via a delivered-ID set seeded from the initial DB snapshot. This rescans the gap range without re-emitting messages we already delivered, and the DB fallback now uses the same lookupAfter so it backfills any messages missing from the cache too. Adds subscribe_out_of_order_internal_test.go which deterministically reproduces the failure via dbmock; without the fix it times out at the promoted user assertion, with the fix it passes. Closes CODAGT-357.
87e5059 to
6e1e5e0
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The merge goroutine in
SubscribeAuthorizedtracks delivered messages with a monotoniclastMessageIDand keys both the durable cache lookup and the DB fallback off it. WhenpublishMessageruns concurrently fromPromoteQueuedand the worker that resumes after the promote, the cache can be populated with the worker's higher-ID resumed-assistant message before the lower-ID promoted user message; PG NOTIFY commit reordering on a busy connection pool can also flip the order in which the subscriber sees notifies. Either waylastMessageIDadvances past the lower ID and the late notify rescans nothing, so the promoted user message is silently dropped from the live stream.Look up the durable cache from
min(notify.AfterMessageID, lastMessageID)and dedupe deliveries through a set seeded from the initial DB snapshot, so the gap range can be rescanned without re-emitting messages we already sent. The DB fallback now uses the samelookupAfterso it can also backfill messages missing from the cache.Closes CODAGT-357.
Investigation notes
Root cause was located by reproducing
TestPromoteQueuedWhileRequiresActionunder stress on this workspace (N=48parallel test binaries, 40 rounds each) and instrumentingcacheDurableMessageplus the merge goroutine's notify branch. Every failure in 14/1900 runs shows the same pattern:dlv exectraces against the unfixed and fixed binaries (deterministic dbmock reproducer) confirm goroutine-by-goroutine:Unfixed: cache fills
[4, 5, 7], merge delivers5, 7, then6lands; the next lookup atlastMessageID=7returns[], so6is never emitted.Fixed: same cache fill,
5and7are delivered withlookupAfter=4. On the late notify withAfterMessageID=5,lookupAfter=min(7,5)=5rescans the cache, the delivered set skips the already-emitted7,6is delivered.Validation:
go test ./coderd/x/chatdand./enterprise/coderd/x/chatdpass.go test -race ./coderd/x/chatd -run TestSubscribe\|TestPromoteQueuedpasses.N=48 ROUNDS=40(1920 runs): 0 failures, was ~7% before the change.TestSubscribeDeliversOutOfOrderDurableMessageincoderd/x/chatd/subscribe_out_of_order_internal_test.gofails in~2swithout the fix and passes with it.A producer-side alternative (per-chat publish mutex held across
PromoteQueued's post-TX publishes) would also close the producer race, but a consumer-side fix is more general because PG NOTIFY commit reordering can produce out-of-order delivery even without producer overlap. Happy to layer both if you'd prefer.