Skip to content

scrape: add dns_refresh_interval to force periodic DNS re-resolution for FQDN targets#18827

Draft
rajnish-jais wants to merge 2 commits into
prometheus:mainfrom
rajnish-jais:rajnish-jais/dns-refresh-interval
Draft

scrape: add dns_refresh_interval to force periodic DNS re-resolution for FQDN targets#18827
rajnish-jais wants to merge 2 commits into
prometheus:mainfrom
rajnish-jais:rajnish-jais/dns-refresh-interval

Conversation

@rajnish-jais

@rajnish-jais rajnish-jais commented May 31, 2026

Copy link
Copy Markdown

Which issue(s) does the PR fix:

Fixes #18387

[FEATURE] Scrape: add `dns_refresh_interval` to `scrape_config` to periodically close idle HTTP connections and force DNS re-resolution for FQDN targets.

What does this PR do?

Adds a dns_refresh_interval field to ScrapeConfig. When set, a background goroutine periodically calls client.CloseIdleConnections() on the scrape pool's HTTP client, forcing a fresh TCP dial — and therefore DNS re-resolution — on the next scrape cycle.

This fixes the case where a target is addressed by FQDN (e.g. a CNAME that points to a changing A record) and Prometheus continues connecting to a stale IP because http.Transport reuses idle TCP connections indefinitely; DialContext (and thus the OS resolver) is only invoked when a new connection is opened.

Why this approach?

Root cause analysis (from the issue thread): the bug is in the scrape layer's connection reuse, not in Consul SD or any other SD mechanism. Any SD that provides an FQDN in __address__ is affected.

The fix is opt-in: the zero value (default) disables the behaviour and preserves existing connection-reuse semantics. The approach mirrors discovery/dns's refresh_interval pattern already present in the codebase.

Changes

  • config/config.go: adds DNSRefreshInterval model.Duration to ScrapeConfig
  • scrape/scrape.go: adds dnsRefreshInterval to scrapePool; wires it from config in newScrapePool and reload; adds runDNSRefresh() goroutine
  • scrape/scrape_test.go: adds TestScrapePoolDNSRefresh verifying that CloseIdleConnections is called at the configured interval

Example config

scrape_configs:
  - job_name: my-fqdn-targets
    dns_refresh_interval: 30s   # re-resolve DNS every 30 seconds
    consul_sd_configs:
      - server: consul.example.com

Design question for maintainers: periodic flush vs DNS-aware detection

I want to be upfront about a gap between this implementation and the language used in the issue thread.

@bwplotka's comment on the issue says "detecting DNS changes", which implies the ideal solution would only force reconnection when the resolved IP actually changes — DNS-aware detection. This implementation takes a simpler but broader approach: it blindly calls CloseIdleConnections() on a fixed timer, which drops all idle connections for the scrape pool (not just those whose DNS record changed), regardless of whether the IP is still the same.

Concretely, the two approaches differ in blast radius:

Approach How it works Blast radius
This PR (periodic flush) Timer fires → CloseIdleConnections() on all idle connections All idle connections for the job, even stable IPs
DNS-aware detection Before/after each dial, resolve hostname → compare IP → break connection if changed Only connections whose backing IP actually changed

The DNS-aware approach would require wrapping the DialContext used by the scrape pool's http.Transport to cache the last-resolved IP per hostname and force a new dial when it changes. This is more surgical but more complex (and requires storing per-host resolver state).

Question for reviewers: Is the periodic CloseIdleConnections() approach acceptable here, or would you prefer the DNS-aware DialContext wrapper? Happy to implement the more targeted approach if that's the direction — wanted to surface this trade-off before investing further.

Open questions

  1. Should there be a global default in GlobalConfig (like scrape_interval has), or is zero/disabled the right default?
  2. Is dns_refresh_interval the right name, or would something like connection_refresh_interval be more accurate (since CloseIdleConnections affects all idle connections, not just DNS-backed ones)?
  3. Any preference on whether the goroutine lives on scrapePool (as implemented) vs a ticker inside scrapeLoop.run() — the pool approach calls CloseIdleConnections once for all loops in a job, which seems preferable.

/cc @bwplotka @mrvarmazyar @roidelapluie

When a scrape target is defined by an FQDN and the underlying A/CNAME record
changes, Prometheus continues connecting to the stale IP because http.Transport
reuses idle TCP connections indefinitely. DNS is only re-resolved when a new
connection is opened.

Add a dns_refresh_interval field to ScrapeConfig. When set, scrapePool starts
a background goroutine that calls client.CloseIdleConnections() at that
interval, forcing a fresh TCP dial (and thus DNS re-resolution) on the next
scrape cycle.

The feature is opt-in: the zero value (default) disables the behaviour and
preserves the current connection-reuse semantics.

Fixes prometheus#18387
@rajnish-jais

Copy link
Copy Markdown
Author

I, @rajnish-jais, have read the CLA Document and I hereby sign the DCO for all commits in this PR.

@rajnish-jais

Copy link
Copy Markdown
Author

Fixed the golangci-lint failure — revive flagged the named receiver ct on RoundTrip in scrape_test.go:164 as unused. Renamed it to _.

CI should be green now. Pinging reviewers again in case the earlier failures caused the PR to be deprioritised: @bwplotka @roidelapluie — happy to address any design feedback, especially on the periodic-flush vs DNS-aware approach question raised in the description.

@rajnish-jais rajnish-jais force-pushed the rajnish-jais/dns-refresh-interval branch from b325124 to f82e822 Compare June 4, 2026 07:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support periodic DNS re-resolution for FQDN targets discovered via Consul SD

1 participant