Skip to content

docker-ssh-agent builds consistently timeout on ci.jenkins.io #4557

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MarkEWaite opened this issue Feb 24, 2025 · 8 comments
Closed

docker-ssh-agent builds consistently timeout on ci.jenkins.io #4557

MarkEWaite opened this issue Feb 24, 2025 · 8 comments

Comments

@MarkEWaite
Copy link

MarkEWaite commented Feb 24, 2025

Service(s)

ci.jenkins.io

Summary

The builds for remoting and docker-ssh-agent fail on ci.jenkins.io due to a timeout. The timeout failures first became visible after the ci.jenkins.io transition from Azure to AWS.

The failures due to timeout may have very different causes, since the remoting timeout seems to always be on Windows agents while the docker-ssh-agent timeout seems to be on Linux.

The remoting timeout is resolved by extending the timeout from 15 seconds to 25 seconds. The docker-ssh-agent timeout is not resolved. Can be confirmed at

Remoting
Docker SSH agent

Reproduction steps

  1. Open the docker-ssh-agent job to see the failures due to timeout
@MarkEWaite MarkEWaite added the triage Incoming issues that need review label Feb 24, 2025
@MarkEWaite
Copy link
Author

MarkEWaite commented Feb 24, 2025

I've replayed the following remoting builds after removing the timeout argument from the call to buildPlugin:

master branch

If those builds are successful, then we only need to adjust the timeout in the remoting repository as a simple short term fix.

MarkEWaite added a commit to MarkEWaite/remoting that referenced this issue Feb 24, 2025
The Windows build time has increased from 7 minutes to 17 minutes.
While we're working to reduce the Windows build time, let's unblock this
build by allowing it to run longer.

jenkins-infra/helpdesk#4557 is the Jenkins
infra help desk that records the issue with ci.jenkins.io Windows build
performance.

Jobs that passed in less than 20 minutes (but more tthan 15 minutes):

* https://ci.jenkins.io/job/Core/job/remoting/job/master/796/
* https://ci.jenkins.io/job/Core/job/remoting/view/change-requests/job/PR-781/
* https://ci.jenkins.io/job/Core/job/remoting/view/change-requests/job/PR-782/
@MarkEWaite
Copy link
Author

MarkEWaite commented Feb 24, 2025

Increase of the remoting timeout from 15 minutes to 25 minutes has been merged from:

MarkEWaite added a commit to jenkinsci/remoting that referenced this issue Feb 24, 2025
The Windows build time has increased from 7 minutes to 17 minutes.
While we're working to reduce the Windows build time, let's unblock this
build by allowing it to run longer.

jenkins-infra/helpdesk#4557 is the Jenkins
infra help desk that records the issue with ci.jenkins.io Windows build
performance.

Jobs that passed in less than 20 minutes (but more tthan 15 minutes):

* https://ci.jenkins.io/job/Core/job/remoting/job/master/796/
* https://ci.jenkins.io/job/Core/job/remoting/view/change-requests/job/PR-781/
* https://ci.jenkins.io/job/Core/job/remoting/view/change-requests/job/PR-782/
@smerle33 smerle33 added this to the infra-team-sync-2025-02-25 milestone Feb 24, 2025
@MarkEWaite MarkEWaite changed the title Remoting and docker-ssh-agent builds consistently timeout on ci.jenkins.io docker-ssh-agent builds consistently timeout on ci.jenkins.io Feb 24, 2025
@MarkEWaite MarkEWaite removed the triage Incoming issues that need review label Feb 25, 2025
@smerle33
Copy link
Contributor

it's not due to a lack of memory or cpu, metrics are ok, seems more a problem with docker on windows.
still investigating

@smerle33
Copy link
Contributor

I did test on a windows VM from the agent template and it went through correctly:

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] image has setup-sshd.ps1 in the correct location
  [+] has setup-sshd.ps1 in C:/ProgramData/Jenkins 2.59s (2.58s|9ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] checking image metadata
  [+] has correct volumes 66ms (59ms|7ms)
  [+] has the source GitHub URL in docker metadata 65ms (62ms|3ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] image has correct version of java and git-lfs installed and in the PATH
  [+] has java installed and in the path 4.62s (4.61s|10ms)
  [+] has git-lfs (and thus git) installed and in the path 1.93s (1.92s|5ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] create agent container with pubkey as argument
  [+] runs commands via ssh 5.1s (5.09s|8ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] create agent container with pubkey as envvar
  [+] runs commands via ssh 15.12s (15.11s|8ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] create agent container like docker-plugin with '/usr/sbin/sshd -D -p 22' as argument
  [+] runs commands via ssh 5.11s (5.1s|10ms)

Describing [jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17] build args
  [+] uses build args correctly 65.68s (65.67s|9ms)
Tests completed in 259.5s
Tests Passed: 10, Failed: 0, Skipped: 0 NotRun: 0
There were 10 passed tests in jenkins/ssh-agent:windowsservercore-ltsc2019-jdk17

so we need to test on an agent currently running, this would need some network adaptation to allow RDS connexion

@lemeurherveCB
Copy link

PR disabling 3 failing tests until this infra issue is resolved to allow new releases in the meantime, ready for review and merging:

dduportal added a commit to jenkinsci/docker-ssh-agent that referenced this issue Mar 4, 2025
…g-test-for-now

chore: disable failing test(s) until jenkins-infra/helpdesk#4557 infra issue is resolved
dduportal added a commit to jenkinsci/docker-ssh-agent that referenced this issue Mar 7, 2025
@dduportal
Copy link
Contributor

dduportal commented Mar 7, 2025

Update: the "longpath" for git is now enabled as per #4574 (comment) and we were able to release a new version.

Let's resume work to understand what is the problem when running the last tests in an EC2 agent: jenkinsci/docker-ssh-agent#496

@dduportal
Copy link
Contributor

Note: the jenkinsci/docker-agent repository also seems to have the same kind of issue as reported and described by @lemeurherveCB

Example in jenkinsci/docker-agent#949 for instance.

Symptoms are close (stuck tests in Pester harness, not on Linux), but there might be different root causes. @lemeurherveCB is expected to try without the --interactive flag for Docker (unneeded in any case), and to use the same kind of technique as I did in jenkinsci/docker-ssh-agent#496 to detect which instruction is stuck

dduportal added a commit to jenkinsci/docker-ssh-agent that referenced this issue Apr 10, 2025
…-failing-test-for-now

Revert "chore: disable failing test(s) until jenkins-infra/helpdesk#4557 infra issue is resolved"
@dduportal
Copy link
Contributor

Closing as per jenkinsci/docker-ssh-agent#496 (tests are back)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants