scrape: add dns_refresh_interval to force periodic DNS re-resolution for FQDN targets#18827
Draft
rajnish-jais wants to merge 2 commits into
Draft
scrape: add dns_refresh_interval to force periodic DNS re-resolution for FQDN targets#18827rajnish-jais wants to merge 2 commits into
rajnish-jais wants to merge 2 commits into
Conversation
When a scrape target is defined by an FQDN and the underlying A/CNAME record changes, Prometheus continues connecting to the stale IP because http.Transport reuses idle TCP connections indefinitely. DNS is only re-resolved when a new connection is opened. Add a dns_refresh_interval field to ScrapeConfig. When set, scrapePool starts a background goroutine that calls client.CloseIdleConnections() at that interval, forcing a fresh TCP dial (and thus DNS re-resolution) on the next scrape cycle. The feature is opt-in: the zero value (default) disables the behaviour and preserves the current connection-reuse semantics. Fixes prometheus#18387
Author
|
I, @rajnish-jais, have read the CLA Document and I hereby sign the DCO for all commits in this PR. |
Author
|
Fixed the CI should be green now. Pinging reviewers again in case the earlier failures caused the PR to be deprioritised: @bwplotka @roidelapluie — happy to address any design feedback, especially on the periodic-flush vs DNS-aware approach question raised in the description. |
9c14a64 to
b325124
Compare
b325124 to
f82e822
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue(s) does the PR fix:
Fixes #18387
What does this PR do?
Adds a
dns_refresh_intervalfield toScrapeConfig. When set, a background goroutine periodically callsclient.CloseIdleConnections()on the scrape pool's HTTP client, forcing a fresh TCP dial — and therefore DNS re-resolution — on the next scrape cycle.This fixes the case where a target is addressed by FQDN (e.g. a CNAME that points to a changing A record) and Prometheus continues connecting to a stale IP because
http.Transportreuses idle TCP connections indefinitely;DialContext(and thus the OS resolver) is only invoked when a new connection is opened.Why this approach?
Root cause analysis (from the issue thread): the bug is in the scrape layer's connection reuse, not in Consul SD or any other SD mechanism. Any SD that provides an FQDN in
__address__is affected.The fix is opt-in: the zero value (default) disables the behaviour and preserves existing connection-reuse semantics. The approach mirrors
discovery/dns'srefresh_intervalpattern already present in the codebase.Changes
config/config.go: addsDNSRefreshInterval model.DurationtoScrapeConfigscrape/scrape.go: addsdnsRefreshIntervaltoscrapePool; wires it from config innewScrapePoolandreload; addsrunDNSRefresh()goroutinescrape/scrape_test.go: addsTestScrapePoolDNSRefreshverifying thatCloseIdleConnectionsis called at the configured intervalExample config
Design question for maintainers: periodic flush vs DNS-aware detection
@bwplotka's comment on the issue says "detecting DNS changes", which implies the ideal solution would only force reconnection when the resolved IP actually changes — DNS-aware detection. This implementation takes a simpler but broader approach: it blindly calls
CloseIdleConnections()on a fixed timer, which drops all idle connections for the scrape pool (not just those whose DNS record changed), regardless of whether the IP is still the same.Concretely, the two approaches differ in blast radius:
CloseIdleConnections()on all idle connectionsThe DNS-aware approach would require wrapping the
DialContextused by the scrape pool'shttp.Transportto cache the last-resolved IP per hostname and force a new dial when it changes. This is more surgical but more complex (and requires storing per-host resolver state).Question for reviewers: Is the periodic
CloseIdleConnections()approach acceptable here, or would you prefer the DNS-awareDialContextwrapper? Happy to implement the more targeted approach if that's the direction — wanted to surface this trade-off before investing further.Open questions
GlobalConfig(likescrape_intervalhas), or is zero/disabled the right default?dns_refresh_intervalthe right name, or would something likeconnection_refresh_intervalbe more accurate (sinceCloseIdleConnectionsaffects all idle connections, not just DNS-backed ones)?scrapePool(as implemented) vs a ticker insidescrapeLoop.run()— the pool approach callsCloseIdleConnectionsonce for all loops in a job, which seems preferable./cc @bwplotka @mrvarmazyar @roidelapluie