You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
spire-agent maintains a cache of JWT-SVIDs (and X509-SVIDs, but those work differently) to serve repeated requests for the same JWT-SVID and to deal with spire-server unavailability.
When a cached JWT-SVID is past 50% of its TTL the agent will first try to request a new one from the server before falling back to the cached SVIDs. All (or most) RPCs the agent makes have a global timeout of 30 seconds applied. So the request to fetch a JWT-SVID may take up to 30 seconds to timeout if the server is unresponsive.
This is not the best, but I think there's a bigger issue here. If a client connects with a timeout smaller than 30 seconds the whole request ends up being cancelled by the client disconnecting so the agent doesn't get a chance to respond with the client. So a client connecting with a timeout of 5 seconds, for example, will just see all the requests timing out even though the cached SVID has enough lifetime left to be useful.
Some ways to deal with this:
Apply a smaller timeout to the NewJWTSVID request when a cached entry exists and is valid, for example 1 second. It won't help every client, but it should help a lot of them. (small change)
Always return the cached SVID and schedule the JWT-SVID for asynchronous refresh. (bigger change)
The text was updated successfully, but these errors were encountered:
Apply a smaller timeout to the NewJWTSVID request when a cached entry exists and is valid, for example 1 second. It won't help every client, but it should help a lot of them. (small change)
I agree this could be a good improvement. I personally think that a 1-second timeout might be a bit too aggressive in some environments. I'd suggest something in the range of 3–5 seconds instead.
spire-agent maintains a cache of JWT-SVIDs (and X509-SVIDs, but those work differently) to serve repeated requests for the same JWT-SVID and to deal with spire-server unavailability.
When a cached JWT-SVID is past 50% of its TTL the agent will first try to request a new one from the server before falling back to the cached SVIDs. All (or most) RPCs the agent makes have a global timeout of 30 seconds applied. So the request to fetch a JWT-SVID may take up to 30 seconds to timeout if the server is unresponsive.
This is not the best, but I think there's a bigger issue here. If a client connects with a timeout smaller than 30 seconds the whole request ends up being cancelled by the client disconnecting so the agent doesn't get a chance to respond with the client. So a client connecting with a timeout of 5 seconds, for example, will just see all the requests timing out even though the cached SVID has enough lifetime left to be useful.
Some ways to deal with this:
The text was updated successfully, but these errors were encountered: