Skip to content

Race condition between signalling pid file and certificate reading #250

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ringerc opened this issue Feb 2, 2025 · 5 comments · May be fixed by #252
Open

Race condition between signalling pid file and certificate reading #250

ringerc opened this issue Feb 2, 2025 · 5 comments · May be fixed by #252

Comments

@ringerc
Copy link
Contributor

ringerc commented Feb 2, 2025

There's a race between the process in pid_file_name being restarted and spiffe-helper signalling it to reload its certificates.

The problem

If the process being controlled is restarted, and it reads its certs before its pid file gets updated, spiffe-helper might replace the certs after the process has read the old copies of the certs. But spiffe-helper might fail to signal the process to reload the certificates because the pid file hasn't been updated yet.

This is likely to occur in cases where the pid file is managed by a separate supervisor rather than the process itself, e.g. a shell script that is managing both spiffe-helper and the process that uses the certs, e.g.

#!/bin/bash
set -e -u

# spiffe_helper.hcl also contains "pid_file_name=mycmd.pid"
pid_file_name="mycmd.pid"

spiffe-helper -c spiffe_helper.hcl &
spiffe_helper_pid=$!

while :
do
  mycmd &
  mycmd_pid=$!
  echo >"${mycmd.pid"} "$!"
  wait -p mycmd_exit_status "${mycmd_pid}"
done

here, if mycmd exits just as spiffe-helper renews a cert, the new mycmd might read the old cert and spiffe-helper might signal the old pid in pid_file_name before the script echos the new pid into mycmd.pid.

This wouldn't be an issue if mycmd did its own pid-file management and wrote its pid file before reading its certs, but neither can be broadly guaranteed for all commands.

Potential solutions

This race can be almost completely closed by having spiffe-helper retry if signalling the pid in pid_file_name fails due to the file being missing or the pid it points to not existing.

That would also help the issue with "normal errors" on startup in #251

It's not a completely perfect fix because in theory the old pid in pid_file_name could be re-used by a new process after the old one exits, before the pid file contents are rewritten to point to the new pid. Signalling it would thus not cause an error, but not deliver a signal to the right process. That's exceedingly unlikely to the point of not being worth caring about unless it could somehow be an exploit, and I don't see how it could be, so it can IMO be ignored.

ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 2, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

Fixes spiffe#250, spiffe#251
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 4, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
ringerc added a commit to ringerc/spiffe-helper-PRs that referenced this issue Feb 5, 2025
There is a race where the process pointed to by pid_file_name could be
restarted and read the old certificates but not have its new pid written
into pid_file_name yet. In this case, we would try to signal the old
process and fail, then give up, so the new process would never get its
certs refreshed.

On initial startup, unavoidable errors may also be logged when the
process to be launched has not yet been started (because its certs don't
exist yet) so pid_file_name is missing or invalid. These shouldn't be
logged as errors, but we don't really want to suppress all errors for
all pid file signalling since that would hide misconfigurations.

Handle both issues by adding a retry-backoff loop for signalling
pid_file_name. On failure, messages are logged at Info level until
the retry limit is exceeded, then an Error is logged.

During startup the controlling process is expected to be watching for
the creation of the certificates, and to launch the controlled process
and create the pid file in a timely manner once the certs appear. So it
becomes possible to start up without "normal" errors appearing.

For the restart-race case, we will initially fail to signal the stale
pid in the pid file, so we will retry until the new pid is written, so
the new process that loaded stale certs will be properly signalled to
reload.

To enable better testing for this, the retry count is exposed in the
test channel for process signalling.

Fixes spiffe#250, spiffe#251

Signed-off-by: Craig Ringer <[email protected]>
@kfox1111
Copy link
Contributor

Can flip around your script I think and get there without changes to spiffe-helper

Fork off a subshell at the beginning of the script, and write out that bash processes pid to the pid file (variable $$). then exec the real command it so the process has the same pid as the bash script used to execute it.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 16, 2025

Good point @kfox1111 , since exec re-uses the pid that'd work.

Sensible workaround.

For others, here's what they suggested doing:

#!/bin/bash
set -e -u

# spiffe_helper.hcl also contains "pid_file_name=mycmd.pid"
pid_file_name="mycmd.pid"

spiffe-helper -c spiffe_helper.hcl &
spiffe_helper_pid=$!

while :
do
  # spawn a subshell, write the subshell's pid to the pid file, and exec the real process to control
  # so it re-uses the pid of the subshell
  ( echo >"${mycmd.pid"} "$$" && exec mycmd ) &
  wait -p mycmd_exit_status "${mycmd_pid}"
done

... or the non-sh equivalent where your wrapper would fork() without exec(), write its pid file, and only then exec() the real process.

@ringerc
Copy link
Contributor Author

ringerc commented Feb 25, 2025

#252 proposes to fix this, once the dependent PRs are merged and I can rebase it

@kfox1111
Copy link
Contributor

Does reordering the script eliminate the need for signal retries?

@ringerc
Copy link
Contributor Author

ringerc commented Mar 2, 2025

@kfox1111 Using the modified script with a subshell will avoid the need for retrying signal delivery, yes. It's not exactly user friendly and most users will probably just have race-prone code without realising, so I'd still suggest adopting retries per the PR I raised. But yes, your suggestion provides a usable workaround, thankyou.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants