Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies #442

frobware · 2025-05-22T18:40:50Z

bpfman-agent runs as a hostNetwork: true DaemonSet to perform
eBPF operations. This introduces two key issues for exposing Prometheus
metrics on a TCP port.

Local Port Conflicts

Using hostNetwork: true means metrics listeners bind directly to the
node’s interface. This causes:

Port conflicts with host services
Inability to run multiple containers using the same port
Errors like listen tcp :8443: bind: address already in use

Cross-Node Networking Breakage

With hostNetwork: true, the pod’s IP becomes the node IP (e.g.,
an EC2 instance IP). As a result, cross-node metrics collection becomes
dependent on infrastructure-level routing:

Normal: Prometheus → Pod IP → Metrics
HostNetwork: Prometheus → Node IP → AWS SG → Node → Pod

Symptoms include:

Only same-node pods are scraped successfully
Prometheus target list shows 1/N pods up
Application port unreachable, even when ICMP ping works
Timeout depending on listener configuration

Mitigation often involves infrastructure-specific workarounds, e.g.:

aws ec2 authorize-security-group-ingress \
  --group-id sg-node \
  --protocol tcp --port 8443 \
  --source-group sg-node

This is brittle and undesirable in multi-tenant or cloud-agnostic deployments.

Solution: Two-Tier Metrics Architecture

Introduce a metrics proxy DaemonSet:

bpfman-agent (hostNetwork)
↳ Unix socket: /var/run/bpfman-metrics/metrics.sock

metrics-proxy (pod network)
↳ Listens on TCP 8443 inside pod network
↳ Proxies requests to agent's Unix socket

metrics-proxy runs without hostNetwork: true and exposes metrics
over HTTPS on port 8443, preserving the current ServiceMonitor and
TLS setup.

Operator Changes

Remove TCP listener from bpfman-agent
Add and manage metrics-proxy DaemonSet
Mount shared socket via hostPath (or projected volume)
Maintain compatibility with Prometheus configuration

Benefits

Avoids host-level port conflicts
No cloud-specific firewall/SG rules needed
Full cross-node scraping support without special infra
Cloud-agnostic; no AWS-only assumptions
Clean separation of eBPF operations and metrics serving

Trade-offs

One extra DaemonSet (~16–64Mi memory, 10–100m CPU)
metrics-proxy needs privileged: true or access to the host path /var/run/bpfman-metrics
Slight increase in complexity and maintenance surface

The text was updated successfully, but these errors were encountered:

github-project-automation bot moved this to 🆕 New in bpfman May 22, 2025

github-project-automation bot added this to bpfman May 22, 2025

frobware added a commit to frobware/bpfman-operator that referenced this issue May 22, 2025

Raised issue: bpfman#442

c1b1384

frobware changed the title ~~Separate metrics collection to eliminate TCP port conflicts and cloud networking dependencies~~ Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies May 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies #442

Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies #442

frobware commented May 22, 2025 •

edited

Loading

Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies #442

Separate metrics collection in bpfman-agent to eliminate TCP port conflicts and cloud networking dependencies #442

Comments

frobware commented May 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

frobware commented May 22, 2025 •

edited

Loading