Skip to content

[spire-agent] improve "outdated entries" detection logic in LRU cache #6011

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
chiragk25 opened this issue Apr 15, 2025 · 2 comments
Open
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog

Comments

@chiragk25
Copy link
Contributor

chiragk25 commented Apr 15, 2025

Problem

We're seeing issues with SPIRE agent requesting an updated SVID for the same entry multiple times in the span of few minutes, with logs showing "Renewing X.509 SVID" on agent.

Our SPIRE server deployment -

  • Running v1.9.6
  • HA with 5-7 SPIRE server instances
  • Event-based cache hydration is enabled

Our SPIRE agent deployment -

  • Running v1.9.6
  • LRU cache enabled
  • Agent access server instances in a round-robin fashion via DNS endpoint

Since we have event based cache hydration enabled, whenever an entry is updated, calls going from agent for GetAuthorizedEntries to different server instances end up return different revisions for the updated entry. Since agent receives toggled revision (old/new) for every call, this results in agent marking the entry as outdated because the currently the agent only checks for revision mismatch, instead of new revision - existingEntry.RevisionNumber != newEntry.RevisionNumber.

This outdated entry is then marked as stale as we have the same logic in the callback function. The stale entry is then force-renewed by the cache, resulting in multiple signings of the same SVID on the server. This continues to happen until all event-based caches reconcile to the new revision.


Proposed solution

To accommodate for split-cache on server, agent should expect a server instance to return an older revision for an entry, and only mark the current entry as outdated if it receives a higher revision. Updates needed in the following places, replacing the revision check from 'existing != new' to 'existing < new' -

@sorindumitru sorindumitru added the triage/in-progress Issue triage is in progress label Apr 18, 2025
@zmt
Copy link
Contributor

zmt commented Apr 22, 2025

We have preliminary results showing a patch with the existingEntry.RevisionNumber < newEntry.RevisionNumber check in pkg/agent/manager/cache/lru_cache.go and existingEntry.RevisionNumber < newEntry.RevisionNumber is very effective at reducing excessive X509SVID signings when servers have mismatching revisions cached for registration entries.

Why do we have the revision check in 2 places?

@sorindumitru
Copy link
Collaborator

There's also

func entryIsStale(entry *common.RegistrationEntry, revisionNumber, revisionCreatedAt int64) bool {
for the SyncAuthorizedEntries case. That one also checks for CreatedAt to cover the case of deleting an entry and recreating it with the same entry id.

@amartinezfayo amartinezfayo added priority/backlog Issue is approved and in the backlog help wanted Issues with this label are ready to start work but are in need of someone to do it and removed triage/in-progress Issue triage is in progress labels Apr 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Issues with this label are ready to start work but are in need of someone to do it priority/backlog Issue is approved and in the backlog
Projects
None yet
Development

No branches or pull requests

4 participants