Skip to content

roachtest: sql-stats/mixed-version failed #146699

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
cockroach-teamcity opened this issue May 14, 2025 · 3 comments · May be fixed by #147379
Open

roachtest: sql-stats/mixed-version failed #146699

cockroach-teamcity opened this issue May 14, 2025 · 3 comments · May be fixed by #147379
Assignees
Labels
branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-observability

Comments

@cockroach-teamcity
Copy link
Member

cockroach-teamcity commented May 14, 2025

roachtest.sql-stats/mixed-version failed with artifacts on release-24.3 @ c0f83306a8557f08d060eb776186b47b92c18615:

(mixedversion.go:791).Run: mixed-version test failure while running step 41 (run "request stmts stats"): failed to request stats for empty interval: Get "https://18.221.226.58:26258/_status/combinedstmts?fetch_mode.stats_type=0&start=0&end=100&limit=10": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
test artifacts and logs in: /artifacts/sql-stats/mixed-version/cpu_arch=arm64/run_1

Parameters:

  • arch=arm64
  • cloud=aws
  • coverageBuild=false
  • cpu=4
  • encrypted=false
  • fs=ext4
  • localSSD=true
  • mvtDeploymentMode=shared-process
  • mvtVersions=v23.2.10 → v24.1.14 → release-24.3
  • runtimeAssertionsBuild=false
  • ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

Grafana is not yet available for aws clusters

/cc @cockroachdb/obs-prs

This test on roachdash | Improve this report!

Jira issue: CRDB-50661

@cockroach-teamcity cockroach-teamcity added C-test-failure Broken test (automatically or manually discovered). O-robot Originated from a bot. O-roachtest release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. T-observability branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 labels May 14, 2025
@alyshanjahani-crl
Copy link
Collaborator

Removing release blocker as we've seen this flake before.

@alyshanjahani-crl alyshanjahani-crl removed the release-blocker Indicates a release-blocker. Use with branch-release-2x.x label to denote which branch is blocked. label May 22, 2025
@dhartunian
Copy link
Collaborator

Seems like node 1 was draining/shutting down while this happened.

Here's when we're making our request:

[mixed-version-test/41_run-request-stmts-stats] 2025/05/14 06:41:53 cluster.go:2565: running cmd `./cockroach auth-session lo...` on nodes [:1]; details in run_064153.304049280_n1_cockroach-authsessio.log
[mixed-version-test/41_run-request-stmts-stats] 2025/05/14 06:42:09 mixed_version_sql_stats.go:208: error requesting stats from url: https://18.221.226.58:26258/_status/combinedstmts?fetch_mode.stats_type=0&start=0&end=100&limit=10
[mixed-version-test/41_run-request-stmts-stats] 2025/05/14 06:42:09 runner.go:381: mixed-version test failure while running step 41 (run "request stmts stats"): failed to request stats for empty interval: Get "https://18.221.226.58:26258/_status/combinedstmts?fetch_mode.stats_type=0&start=0&end=100&limit=10": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

The combinedstmts will fan out requests to all the in-memory containers in the cluster.

Here's node 1 starting shutdown prior to our request:

teamcity-19685427-1747200609-41-n5cpu4-0001> I250514 06:34:44.395786 1 1@cli/start.go:997 ⋮ [T1,Vsystem,n1] 521 initiating graceful shutdown of server

Perhaps we should have longer timeouts to account for this possibility during mixed-version testing. Or we should reduce our own timeouts in the code to fail faster. This would be the code here: https://github.com/cockroachdb/cockroach/blob/master/pkg/server/statements.go#L77-L94

Pretty suspicious that we set noTimeout option here. We should reconsider that.

dhartunian added a commit to dhartunian/cockroach that referenced this issue May 27, 2025
Previously, we had some HTTP requests to SQL Stats which did not
retry. Retries have been added there.

Additionally, we will retry on HTTP errors as well to deal with
timeout errors and cluster network issues. This test is bringing nodes
up and down during upgrades and a retry is sometimes necessary to make
sure we don't get stuck waiting on a draining node.

Resolves: cockroachdb#146699

Release note: None
@dhartunian
Copy link
Collaborator

Ended up adding retries to the test, not the server. I attempted adding them but realized that it is best to push this down to the client and have the server not mask the issue. Otherwise, we end up with more conditional behavior on the server and need to cross reference with timeout settings. Simpler for client to expect to retry.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
branch-release-24.3 Used to mark GA and release blockers, technical advisories, and bugs for 24.3 C-test-failure Broken test (automatically or manually discovered). O-roachtest O-robot Originated from a bot. T-observability
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants