Decouple metrics from hostNetwork using proxy DaemonSet #443

frobware · 2025-05-27T17:38:22Z

The bpfman-agent DaemonSet requires hostNetwork: true for eBPF operations such as loading XDP programs and accessing host network interfaces. It also exposes Prometheus metrics via TCP port 8443 on the host network.

In #437, the metrics service was updated to use TCP port 8443 by default. This aligned with controller-runtime’s default for secure metrics endpoints and is configurable via the bpfman ConfigMap.

However, 8443 is a commonly used port and may plausibly be claimed by other host-level services or privileged containers. In clusters where hostNetwork: true is used, this increases the risk of port binding conflicts. The underlying issue is not the specific port but the need to bind to any TCP port on the host network.

Additionally, in PR #437, the metrics Service was not marked clusterIP: None, so it was not headless. As a result, Prometheus would scrape the Service’s cluster IP, which performs round-robin load balancing across pods. This is not suitable for a DaemonSet, where metrics must be scraped from each pod individually (e.g., to collect per-node data). A headless Service is required to expose the full set of pod endpoints for proper per-pod scraping.

In cloud environments with restrictive security groups, the use of hostNetwork: true introduces additional operational complexity. Since hostNetwork pods are assigned node IPs rather than pod IPs, Prometheus scraping across nodes may fail (does fail) unless explicit firewall rules or security group exceptions are configured. This creates cloud-provider-specific coupling, requires coordination with infrastructure teams, and increases deployment friction - particularly in environments like AWS where inter-node traffic to arbitrary ports is not allowed by default.

This PR proposes an architectural change to eliminate the requirement to bind a host port for metrics entirely.

Options considered:

Do not expose metrics from bpfman-agent.
- Simple
- But eliminates observability
Leave metrics on hostNetwork and document the need to open firewall ports in environments that restrict inter-node traffic (e.g. cloud platforms using security groups). This is not typically required in libvirt or bare-metal environments.
- Minimal change (would still want a headless service, for example)
- Still subject to TCP port conflicts and blocked metrics
Introduce a metrics-proxy DaemonSet (chosen).
- Avoids host-level TCP port binding
- Enables per-pod scraping over HTTPS
- Works across clouds without firewall changes
- Small resource overhead

Implementation:

Add a metrics-proxy DaemonSet that runs in the container network.
It mounts a shared volume with bpfman-agent to access a Unix domain socket.
Proxies metrics over HTTPS using controller-runtime’s metrics server.
bpfman-agent now serves metrics only over the socket, no longer binds TCP ports.
The associated Service is marked clusterIP: None to enable per-pod scraping.
ServiceMonitor now targets /proxy-metrics.
Proxy HTTP client timeout is set to 8 seconds to fail before Prometheus’s 10-second default scrape timeout.

Outcome:

Port conflicts are eliminated.
Prometheus can scrape all pods without infrastructure changes.
No need to modify AWS security groups or cloud firewall rules.
Metrics are cleanly separated from core eBPF operations.

Cons:

Another daemonset, more resource usage
Requires privileged: true to access the host-mounted Unix socket, but:
- This does not widen the security surface - the original bpfman-agent pod already ran with privileged: true
- The metrics-proxy pod inherits only the minimum required permissions
- The shared volume containing the Unix socket is mounted read-only
- The proxy only reads metrics; it does not interact with eBPF or mutate state

The bpfman-agent DaemonSet runs with hostNetwork=true, which causes multiple issues for metrics collection: 1. Port conflicts when pods on the same node bind to the same port 2. Cloud security groups blocking inter-node scraping 3. Non-headless Service preventing per-pod discovery This change decouples metrics collection from hostNetwork by introducing a dedicated metrics-proxy DaemonSet: - Runs on the container network, avoiding host-level constraints - Uses a headless Service for correct per-pod scraping - Proxies bpfman-agent metrics via Unix domain socket - Serves HTTPS on port 8443 with unified TLS handling: * OpenShift: uses Kubernetes service-serving certificates * Local/KIND: auto-generates self-signed certificates The ServiceMonitor now targets /proxy-metrics on container-network pods. This removes the need to open cloud firewall ports while maintaining secure HTTPS-based scraping. Signed-off-by: Andrew McDermott <[email protected]>

$ make bundle Signed-off-by: Andrew McDermott <[email protected]>

frobware force-pushed the two-tier-metrics-agent-collection branch from 7c5612f to d8b3348 Compare May 27, 2025 17:38

frobware marked this pull request as draft May 27, 2025 17:39

frobware force-pushed the two-tier-metrics-agent-collection branch 2 times, most recently from 27f95d5 to 77139a1 Compare May 28, 2025 10:25

frobware force-pushed the two-tier-metrics-agent-collection branch from 77139a1 to 6d4810a Compare May 28, 2025 10:59

Regenerate the bundle

157fc35

$ make bundle Signed-off-by: Andrew McDermott <[email protected]>

frobware force-pushed the two-tier-metrics-agent-collection branch from 6d4810a to 157fc35 Compare May 28, 2025 11:00

frobware changed the title ~~[WIP] Decouple metrics from hostNetwork via dedicated proxy~~ Decouple metrics from hostNetwork using proxy DaemonSet May 28, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Decouple metrics from hostNetwork using proxy DaemonSet #443

Decouple metrics from hostNetwork using proxy DaemonSet #443

Uh oh!

frobware commented May 27, 2025 •

edited

Loading

Uh oh!

Uh oh!

Decouple metrics from hostNetwork using proxy DaemonSet #443

Are you sure you want to change the base?

Decouple metrics from hostNetwork using proxy DaemonSet #443

Uh oh!

Conversation

frobware commented May 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

frobware commented May 27, 2025 •

edited

Loading