Skip to content

roachtest: move task termination #147206

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

herkolategan
Copy link
Collaborator

Previously, the task manager was terminated during test teardown. The teardown
would happen directly after a test timeout as well. At this point the test code
could still be running, and new tasks could be initiated. This could result in
undefined behavior. This change moves the task termination to after the test has
returned.

Even though it's still possible to start new tasks after a test has timed out,
these tasks should be short-lived and should not cause any issues. When the test
code returns, the task manager is terminated and any stray tasks are cleaned up.

Fixes: #143973

Epic: None
Release note: None

Expose cancel on the task manager to allow the test framework to cancel tasks
without having to terminate the manager.

Epic: None
Release note: None
@herkolategan herkolategan requested a review from a team as a code owner May 23, 2025 09:49
@herkolategan herkolategan requested review from golgeek and DarrylWong and removed request for a team May 23, 2025 09:49
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@srosenberg
Copy link
Member

Even though it's still possible to start new tasks after a test has timed out, these tasks should be short-lived and should not cause any issues. When the test code returns, the task manager is terminated and any stray tasks are cleaned up.

If it's a timeout, the return escapes; i.e., when Run finally returns, defer will close testReturnedCh, but taskManager.Terminate isn't invoked.

@herkolategan
Copy link
Collaborator Author

herkolategan commented May 26, 2025

Even though it's still possible to start new tasks after a test has timed out, these tasks should be short-lived and should not cause any issues. When the test code returns, the task manager is terminated and any stray tasks are cleaned up.

If it's a timeout, the return escapes; i.e., when Run finally returns, defer will close testReturnedCh, but taskManager.Terminate isn't invoked.

Argh, you're right, good catch. Tempted to move it to the defer that closes the channel. Will have a look again tomorrow to make sure this executes after test code.

@herkolategan herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from 14fbb06 to 86f7837 Compare May 28, 2025 10:27
@@ -2296,6 +2296,15 @@ func monitorTasks(ctx context.Context, taskManager task.Manager, t test.Test, l
}
}
}()

return func() {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved task termination to be returned as a function that is deferred to invoke after the test returns.

@herkolategan herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from 86f7837 to e540a26 Compare May 28, 2025 13:55
@@ -1402,6 +1404,10 @@ func (r *testRunner) runTest(
// We suppress other failures from being surfaced to the top as the timeout is always going
// to be the main error and subsequent errors (i.e. context cancelled) add noise.
t.suppressFailures()

// Cancel tasks to ensure that any stray tasks are cleaned up.
t.taskManager.Cancel()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Above we add the timeout failure intentionally without cancelling the context so why don't we do something similar here? i.e. why not call t.taskManager.Cancel() in teardownTest in the timeout case:

// We previously added a timeout failure without cancellation, so we cancel here.
if t.mu.cancel != nil {
t.mu.cancel()
}

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll move it.

Previously, the task manager was terminated during test teardown. The teardown
would happen directly after a test timeout as well. At this point the test code
could still be running, and new tasks could be initiated. This could result in
undefined behavior. This change moves the task termination to after the test has
returned.

Even though it's still possible to start new tasks after a test has timed out,
these tasks should be short-lived and should not cause any issues. When the test
code returns, the task manager is terminated and any stray tasks are cleaned up.

Fixes: cockroachdb#143973

Epic: None
Release note: None
@herkolategan herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from e540a26 to 972732a Compare May 29, 2025 09:38
Copy link
Contributor

@DarrylWong DarrylWong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

cmd/roachtest: TestRunnerTestTimeout failed
4 participants