Skip to content

Can't fetch cached JWT-SVIDs when spire-server is unresponsive #5994

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
sorindumitru opened this issue Apr 7, 2025 · 1 comment
Open
Labels
priority/backlog Issue is approved and in the backlog

Comments

@sorindumitru
Copy link
Collaborator

spire-agent maintains a cache of JWT-SVIDs (and X509-SVIDs, but those work differently) to serve repeated requests for the same JWT-SVID and to deal with spire-server unavailability.

When a cached JWT-SVID is past 50% of its TTL the agent will first try to request a new one from the server before falling back to the cached SVIDs. All (or most) RPCs the agent makes have a global timeout of 30 seconds applied. So the request to fetch a JWT-SVID may take up to 30 seconds to timeout if the server is unresponsive.

This is not the best, but I think there's a bigger issue here. If a client connects with a timeout smaller than 30 seconds the whole request ends up being cancelled by the client disconnecting so the agent doesn't get a chance to respond with the client. So a client connecting with a timeout of 5 seconds, for example, will just see all the requests timing out even though the cached SVID has enough lifetime left to be useful.

Some ways to deal with this:

  • Apply a smaller timeout to the NewJWTSVID request when a cached entry exists and is valid, for example 1 second. It won't help every client, but it should help a lot of them. (small change)
  • Always return the cached SVID and schedule the JWT-SVID for asynchronous refresh. (bigger change)
@sorindumitru sorindumitru added the triage/in-progress Issue triage is in progress label Apr 10, 2025
@amartinezfayo amartinezfayo added priority/backlog Issue is approved and in the backlog and removed triage/in-progress Issue triage is in progress labels Apr 10, 2025
@amartinezfayo
Copy link
Member

Thank you @sorindumitru for raising this.

Apply a smaller timeout to the NewJWTSVID request when a cached entry exists and is valid, for example 1 second. It won't help every client, but it should help a lot of them. (small change)

I agree this could be a good improvement. I personally think that a 1-second timeout might be a bit too aggressive in some environments. I'd suggest something in the range of 3–5 seconds instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/backlog Issue is approved and in the backlog
Projects
None yet
Development

No branches or pull requests

2 participants