Decouple metrics from hostNetwork using proxy DaemonSet #443
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The bpfman-agent DaemonSet requires
hostNetwork: true
for eBPF operations such as loading XDP programs and accessing host network interfaces. It also exposes Prometheus metrics via TCP port 8443 on the host network.In #437, the metrics service was updated to use TCP port 8443 by default. This aligned with controller-runtime’s default for secure metrics endpoints and is configurable via the bpfman ConfigMap.
However, 8443 is a commonly used port and may plausibly be claimed by other host-level services or privileged containers. In clusters where
hostNetwork: true
is used, this increases the risk of port binding conflicts. The underlying issue is not the specific port but the need to bind to any TCP port on the host network.Additionally, in PR #437, the metrics Service was not marked
clusterIP: None
, so it was not headless. As a result, Prometheus would scrape the Service’s cluster IP, which performs round-robin load balancing across pods. This is not suitable for a DaemonSet, where metrics must be scraped from each pod individually (e.g., to collect per-node data). A headless Service is required to expose the full set of pod endpoints for proper per-pod scraping.In cloud environments with restrictive security groups, the use of
hostNetwork: true
introduces additional operational complexity. Since hostNetwork pods are assigned node IPs rather than pod IPs, Prometheus scraping across nodes may fail (does fail) unless explicit firewall rules or security group exceptions are configured. This creates cloud-provider-specific coupling, requires coordination with infrastructure teams, and increases deployment friction - particularly in environments like AWS where inter-node traffic to arbitrary ports is not allowed by default.This PR proposes an architectural change to eliminate the requirement to bind a host port for metrics entirely.
Options considered:
Do not expose metrics from bpfman-agent.
Leave metrics on hostNetwork and document the need to open firewall ports in environments that restrict inter-node traffic (e.g. cloud platforms using security groups). This is not typically required in libvirt or bare-metal environments.
Introduce a metrics-proxy DaemonSet (chosen).
Implementation:
Outcome:
Cons:
privileged: true
to access the host-mounted Unix socket, but:privileged: true