Skip to content

Trust bundle missing active signing keys on some clusters #6083

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
IvMdlc opened this issue May 22, 2025 · 3 comments · May be fixed by #6090
Open

Trust bundle missing active signing keys on some clusters #6083

IvMdlc opened this issue May 22, 2025 · 3 comments · May be fixed by #6090
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Milestone

Comments

@IvMdlc
Copy link

IvMdlc commented May 22, 2025

Hi, coming from this thread https://spiffe.slack.com/archives/C7XDP01HB/p1747864066874729 where @evan2645 and @sorindumitru asked me to raise an issue.

We have a root cluster and 8 regional clusters, all of them running on EC2 instances and AWS Postgres as datastore. A few days ago we terminated/created all the EC2 instances in a rolling fashion, one node at a time, as we’ve done before. DBs are always untouched. The /data directory is gone when servers startup as we don’t persist it, so servers can’t find keys and they create new ones.

We've noticed that some clusters are missing signing keys:

on root cluster:
$ spire-server bundle show -format spiffe | grep kid | wc -l
58

on 5 out of 8 regional clusters:
$ spire-server bundle show -format spiffe | grep kid | wc -l
58

and on the other 3:
$ spire-server bundle show -format spiffe | grep kid | wc -l
53

We are running 1.10.4

Thank you.

@IvMdlc
Copy link
Author

IvMdlc commented May 25, 2025

For completeness, persisting /data directory is irrelevant. Tested in DEV, hit the same issue whenever the server is restarted, even when keys are found on startup.

time="2025-05-25T10:17:27+01:00" level=debug msg="Loaded server id" external=false plugin_name=aws_kms plugin_type=KeyManager server_id=537f035e-7b72-49a2-8754-92c8fed0fe59 subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Fetching key aliases from KMS" external=false plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Found aliases" external=false num_aliases=100 plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Found aliases" external=false num_aliases=100 plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Found aliases" external=false num_aliases=46 plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Key loaded" alias_name=alias/SPIRE_SERVER/<trust-domain>/537f035e-7b72-49a2-8754-92c8fed0fe59/JWT-Signer-A external=false key_arn="arn:aws:kms:eu-west-2:<accountid>:key/df8f98dd-3df1-4825-8430-b8028231320b" plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Key loaded" alias_name=alias/SPIRE_SERVER/<trust-domain>/537f035e-7b72-49a2-8754-92c8fed0fe59/x509-CA-B external=false key_arn="arn:aws:kms:eu-west-2:<accountid>:key/a02a814c-8241-4a49-bdce-5004497cc168" plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Key loaded" alias_name=alias/SPIRE_SERVER/<trust-domain>/537f035e-7b72-49a2-8754-92c8fed0fe59/x509-CA-A external=false key_arn="arn:aws:kms:eu-west-2:<accountid>:key/1ccf2f7f-503b-4c01-ac6a-080769ebf149" plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
time="2025-05-25T10:17:27+01:00" level=debug msg="Key loaded" alias_name=alias/SPIRE_SERVER/<trust-domain>/537f035e-7b72-49a2-8754-92c8fed0fe59/JWT-Signer-B external=false key_arn="arn:aws:kms:eu-west-2:<accountid>:key/8b4f05f5-2efb-48fd-a564-cfac0a1ac380" plugin_name=aws_kms plugin_type=KeyManager subsystem_name=catalog
…
time="2025-05-25T10:17:27+01:00" level=debug msg="Loading journal from datastore" subsystem_name=ca_manager
time="2025-05-25T10:17:27+01:00" level=debug msg="Found a CA journal record that matches with a local X509 authority ID" ca_journal_id=7 local_authority_id=131774f28fc1505a0d5d431be60b69e5fc0d477e subsystem_name=ca_manager
time="2025-05-25T10:17:27+01:00" level=info msg="Journal loaded" jwt_keys=10 subsystem_name=ca_manager x509_cas=10
time="2025-05-25T10:17:27+01:00" level=info msg="X509 CA activated" expiration="2025-05-25 12:13:36 +0100 BST" issued_at="2025-05-25 06:13:36 +0100 BST" local_authority_id=131774f28fc1505a0d5d431be60b69e5fc0d477e slot=A subsystem_name=ca_manager upstream_authority_id=98640cc3e8326e41e1fcba329860809d1aff1e0a
time="2025-05-25T10:17:27+01:00" level=debug msg="Successfully stored CA journal entry in datastore" ca_journal_id=7 local_authority_id=131774f28fc1505a0d5d431be60b69e5fc0d477e subsystem_name=ca_manager
time="2025-05-25T10:17:27+01:00" level=debug msg="Successfully rotated X.509 CA" subsystem_name=ca_manager trust_domain_id="spiffe://<trust-domain>" ttl=6968.537959288
time="2025-05-25T10:17:27+01:00" level=info msg="JWT key activated" expiration="2025-05-25 12:20:56 +0100 BST" issued_at="2025-05-25 06:20:56 +0100 BST" local_authority_id=IM8fyFCEcaTeaXX3wL6eFqpsv8Robrib slot=A subsystem_name=ca_manager
time="2025-05-25T10:17:27+01:00" level=debug msg="Successfully stored CA journal entry in datastore" ca_journal_id=7 local_authority_id=131774f28fc1505a0d5d431be60b69e5fc0d477e subsystem_name=ca_manager
time="2025-05-25T10:17:27+01:00" level=debug msg="Rotating server SVID" subsystem_name=svid_rotator
time="2025-05-25T10:17:28+01:00" level=debug msg="Signed X509 SVID" expiration="2025-05-25T10:17:28Z" spiffe_id="spiffe://<trust-domain>/spire/server" subsystem_name=svid_rotator

Correct bundle contains 29 keys. On any server:
$ spire-server bundle show -format spiffe | grep kid | wc -l
29

On the restarted server is missing one
$ spire-server bundle show -format spiffe | grep kid | wc -l
28

As @sorindumitru and @evan2645 suspected, it will be broken until server prepares the first JWT key.

@sorindumitru suggested to upgrade to v1.11 and on the server missing keys, issue
$ spire-server localauthority jwt prepare

I can confirm this fixes it.

However, it might not be a valid workaround for our use case. That command injects the prepared key into the bundle, making the bundle larger on each restart. The problem is that we use SPIRE for OIDC federation with AWS:

https://spiffe.io/docs/latest/keyless/oidc-federation-aws/
https://docs.aws.amazon.com/IAM/latest/UserGuide/id_roles_providers_create_oidc.html

and the /keys endpoint can't return more than 100 keys, otherwise AWS won't be able to verify the token. From https://repost.aws/knowledge-center/iam-sts-invalididentitytoken

STS only supports up to 100 keys in a JWKS. If your JWKS has more than 100 keys, then STS can't verify the tokens signed with your keys.

Is there a way to reduce the number of keys in the JWKS?

@amartinezfayo amartinezfayo removed the triage/in-progress Issue triage is in progress label May 27, 2025
@amartinezfayo amartinezfayo added the priority/urgent Issue is approved and is must be completed in the assigned milestone label May 27, 2025
@amartinezfayo amartinezfayo added this to the 1.12.3 milestone May 27, 2025
@sorindumitru
Copy link
Collaborator

Thanks @IvMdlc for confirming restarts trigger the issue. I'm working on a fix for this and will open a PR to address it.

I'm afraid there's no good way to make sure you don't end up with more than 100 keys in the bundle. As you mentioned on slack, you may be able to do that by revoking old keys. Just make sure you wait some time after you prepare a new one so that the bundle update propagates to workloads.

@IvMdlc
Copy link
Author

IvMdlc commented May 28, 2025

Thanks @sorindumitru for working on a fix.

While experimenting with the localauthority, I observed that when I taint and revoke a key on a spire-server, the key is immediately removed from the bundle visible to the cluster hosting the spire-server. However, other clusters continue to see a bundle that includes the key, and it is only removed some hours after the key expires. Not an issue for us, although I can raise a separate issue if you think it's worth investigating.

@sorindumitru sorindumitru linked a pull request May 28, 2025 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/urgent Issue is approved and is must be completed in the assigned milestone
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants