roachtest: move task termination #147206

herkolategan · 2025-05-23T09:49:53Z

Previously, the task manager was terminated during test teardown. The teardown
would happen directly after a test timeout as well. At this point the test code
could still be running, and new tasks could be initiated. This could result in
undefined behavior. This change moves the task termination to after the test has
returned.

Even though it's still possible to start new tasks after a test has timed out,
these tasks should be short-lived and should not cause any issues. When the test
code returns, the task manager is terminated and any stray tasks are cleaned up.

Fixes: #143973

Epic: None
Release note: None

Expose cancel on the task manager to allow the test framework to cancel tasks without having to terminate the manager. Epic: None Release note: None

cockroach-teamcity · 2025-05-23T09:50:07Z

This change is

srosenberg · 2025-05-24T02:16:27Z

Even though it's still possible to start new tasks after a test has timed out, these tasks should be short-lived and should not cause any issues. When the test code returns, the task manager is terminated and any stray tasks are cleaned up.

If it's a timeout, the return escapes; i.e., when Run finally returns, defer will close testReturnedCh, but taskManager.Terminate isn't invoked.

herkolategan · 2025-05-26T22:34:54Z

Even though it's still possible to start new tasks after a test has timed out, these tasks should be short-lived and should not cause any issues. When the test code returns, the task manager is terminated and any stray tasks are cleaned up.

If it's a timeout, the return escapes; i.e., when Run finally returns, defer will close testReturnedCh, but taskManager.Terminate isn't invoked.

Argh, you're right, good catch. Tempted to move it to the defer that closes the channel. Will have a look again tomorrow to make sure this executes after test code.

herkolategan · 2025-05-28T10:28:50Z

pkg/cmd/roachtest/test_runner.go

@@ -2296,6 +2296,15 @@ func monitorTasks(ctx context.Context, taskManager task.Manager, t test.Test, l
 			}
 		}
 	}()
+
+	return func() {


Moved task termination to be returned as a function that is deferred to invoke after the test returns.

DarrylWong · 2025-05-28T15:27:14Z

pkg/cmd/roachtest/test_runner.go

@@ -1402,6 +1404,10 @@ func (r *testRunner) runTest(
 		// We suppress other failures from being surfaced to the top as the timeout is always going
 		// to be the main error and subsequent errors (i.e. context cancelled) add noise.
 		t.suppressFailures()
+
+		// Cancel tasks to ensure that any stray tasks are cleaned up.
+		t.taskManager.Cancel()


Above we add the timeout failure intentionally without cancelling the context so why don't we do something similar here? i.e. why not call t.taskManager.Cancel() in teardownTest in the timeout case:

cockroach/pkg/cmd/roachtest/test_runner.go

Lines 1631 to 1634 in e540a26

// We previously added a timeout failure without cancellation, so we cancel here.

if t.mu.cancel != nil {

t.mu.cancel()

}

Good point, I'll move it.

Previously, the task manager was terminated during test teardown. The teardown would happen directly after a test timeout as well. At this point the test code could still be running, and new tasks could be initiated. This could result in undefined behavior. This change moves the task termination to after the test has returned. Even though it's still possible to start new tasks after a test has timed out, these tasks should be short-lived and should not cause any issues. When the test code returns, the task manager is terminated and any stray tasks are cleaned up. Fixes: cockroachdb#143973 Epic: None Release note: None

DarrylWong

LGTM

roachtest: expose cancel on manager

f4a12d1

Expose cancel on the task manager to allow the test framework to cancel tasks without having to terminate the manager. Epic: None Release note: None

herkolategan requested a review from a team as a code owner May 23, 2025 09:49

herkolategan requested review from golgeek and DarrylWong and removed request for a team May 23, 2025 09:49

herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from 14fbb06 to 86f7837 Compare May 28, 2025 10:27

herkolategan commented May 28, 2025

View reviewed changes

herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from 86f7837 to e540a26 Compare May 28, 2025 13:55

DarrylWong reviewed May 28, 2025

View reviewed changes

herkolategan force-pushed the hbl/raochtest-manager-termination-fix branch from e540a26 to 972732a Compare May 29, 2025 09:38

DarrylWong approved these changes May 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

roachtest: move task termination #147206

roachtest: move task termination #147206

Uh oh!

herkolategan commented May 23, 2025

Uh oh!

cockroach-teamcity commented May 23, 2025

Uh oh!

srosenberg commented May 24, 2025

Uh oh!

herkolategan commented May 26, 2025 •

edited

Loading

Uh oh!

herkolategan May 28, 2025

Uh oh!

DarrylWong May 28, 2025

Uh oh!

herkolategan May 28, 2025

Uh oh!

DarrylWong left a comment

Uh oh!

Uh oh!

	// We previously added a timeout failure without cancellation, so we cancel here.
	if t.mu.cancel != nil {
	t.mu.cancel()
	}

roachtest: move task termination #147206

Are you sure you want to change the base?

roachtest: move task termination #147206

Uh oh!

Conversation

herkolategan commented May 23, 2025

Uh oh!

cockroach-teamcity commented May 23, 2025

Uh oh!

srosenberg commented May 24, 2025

Uh oh!

herkolategan commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

herkolategan May 28, 2025

Choose a reason for hiding this comment

Uh oh!

DarrylWong May 28, 2025

Choose a reason for hiding this comment

Uh oh!

herkolategan May 28, 2025

Choose a reason for hiding this comment

Uh oh!

DarrylWong left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

herkolategan commented May 26, 2025 •

edited

Loading