[OKD-SCOS v4.16] UPI install with agent installer or assisted installer fails for 4.16.0-0.okd-scos-2024-08-01-132038 #2015
Replies: 17 comments 12 replies
-
I've just checked that, agent-based installer is not a part of docs.okd.io, you can find it on docs.openshift.com. May be this method is not supported by OKD! |
Beta Was this translation helpful? Give feedback.
-
Follow-upAlso fails with the Assisted Installer:The installation fails with the Assisted Installer in the same way. Not a big surprise as the assisted installer uses the agent installer under the hood
Workaround for a success with Assisted Installer:The installation works when replacing the embedded FCOS bootstrap OS by RHCOS:
Workaround for a success with Agent Installer (ABI):Overriding the bootstrap OS image with a RHCOS image make the installation a success also when installing OKD via ABI by using the following when building the install image:
I did not choose a random OS bootstrap OS image, this is the one for v4.16 specified for an OCP installation via the ABI as specified here: https://github.com/openshift/assisted-service/blob/d3324b06a7c7772f4619c3ab13dd8c0706e55fd9/deploy/podman/configmap.yml#L25 Q:
|
Beta Was this translation helpful? Give feedback.
-
Hi, I have encountered the exact same issue. Overriding the bootstrap OS image with a RHCOS image as proposed is not working for me as my lab hardware is using an LSI hardware (Dell Perc H310 using an IT firmware) that is not recognize by the RHCOS image (the mpt3sas driver on RHCOS has disabled the support for this driver). Also the other alternative to use openshift-install-linux-4.15.0-0.okd-scos-2024-01-18-223523 do not work for me : the rendezvous host fails to start the bootstrap with an error 'pull secret for new cluster is invalid: pull secret must contain auth for "registry.ci.openshift.org"'. @titou10titou10, shouldn't you open an issue about this problem ? Best Regards, Alain |
Beta Was this translation helpful? Give feedback.
-
I reproduced this problem with 4.16.0-0.okd-scos-2024-09-27-110344 image on Dell PowerEdge R740. I also verified the proposed RHCOS workaround. That worked fine for me. |
Beta Was this translation helpful? Give feedback.
-
We have OKD releases now based on SCOS bootimages, so this should not be a problem anymore. @titou10titou10 could you give 4.19.0-okd-scos.ec.5 a try - for SNO as well as assisted/agent? Thanks! |
Beta Was this translation helpful? Give feedback.
-
Test install of a SNO with ABI of v4.19.0-okd-scos.ec.5 without overriding the bootstrap image with a n RH image:
Last remark. I have to re-verify this, but during the first boot, the boot process stops, with a screen asking to configure the network with 2 choices, either "quit" or "configure". ... and I like the new 4.19 console visual and new features (favorites, helm,...)...but that's another story lol |
Beta Was this translation helpful? Give feedback.
-
Thanks for the quick feedback:
This is because the "machine-os-content" image was still pulling in the rhcos iso, this has been fixed recently. On my next promotion of an ec image, this should be addressed. Meanwhile you can use the latest image in the 4.19.0-0.okd-scos stream, but you need a pull secret for that.
Seems like other components are hitting this issue as well. See: https://issues.redhat.com//browse/OCPBUGS-54175. There is an ongoing fix for this in the MCO. For the issue with the assisted-service-db, could you please collect the journal logs for that unit so we can take a look ? |
Beta Was this translation helpful? Give feedback.
-
OK. That makes sense . A similar fix is being discussed to add |
Beta Was this translation helpful? Give feedback.
-
@Prashanth684 I f I understand well what happened. the just merged fix will be in v4.20.0+, right? Will it be backported to v4.19 or v4.18? |
Beta Was this translation helpful? Give feedback.
-
The latest release nightly: 4.19.0-0.okd-scos-2025-03-28-055146 has the fix. I am going to promote this to an ec. Would appreciate if you could quickly test it though. |
Beta Was this translation helpful? Give feedback.
-
Test install of a SNO with ABI of v4.19.0-okd-scos.ec.6 without overriding the bootstrap image with an RH image:
Alert: 100% of the console/console targets in openshift-console namespace have been unreachable for more than 15 minutes. This may be a symptom of network connectivity issues, down nodes, or failures within these components. Assess the health of the infrastructure and nodes running these targets and then contact support.
Message: E0330 13:32:38.406673 1 middleware.go:51] TOKEN_REVIEW: 'GET /metrics' unauthorized, invalid user token, [invalid bearer token, token lookup failed] Experience is quite smooth now. Thanks |
Beta Was this translation helpful? Give feedback.
-
Done a test for a full 3 +3 nodes cluster with v4.19.0-okd-scos.ec.6. Works well.
|
Beta Was this translation helpful? Give feedback.
-
@Prashanth684 the problem with the postgresql startup reappeared in v4.19.0-okd-scos.ec.9 ! |
Beta Was this translation helpful? Give feedback.
-
I have to redo an install
You fixed this kind of message in previous versions... The final/running image is:
|
Beta Was this translation helpful? Give feedback.
-
I was finally able to test it again...and it works. Everything seems ok... I don't know what to think...
|
Beta Was this translation helpful? Give feedback.
-
Hi @Prashanth684! Thanks for fixing the issue with postgres startup in the assisted-service-db service. I can confirm that agent-based installation is working in recent releases (I have tested on 4.19.0-okd-scos.ec.8 and 4.19.0-okd-scos.ec.9). However, the problem still occurs in 4.16 releases after 4.16.0-0.okd-scos-2024-08-01-132038, as well as all 4.17 and 4.18 releases. Would it be possible to backport assisted-service#7458 at least to OKD 4.18 so that there is a stable release containing the fix? FWIW, it appears that the underlying problem is that, starting with 4.16.0-0.okd-scos-2024-09-24-151747, the
Compared to the prior working version...
|
Beta Was this translation helpful? Give feedback.
-
Does Anyone test install [4.19.0-okd-scos.ec.8 and [4.19.0-okd-scos.ec.9] with Agent or Assist Install with Air Gap ? Does it work or not ? |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
Trying to install a SNO (Single Node) cluster:
It is important to note that the install works perfectly well with the exact same agent and install config files for
I also have tried for a multi-node cluster, it fails the same way
Summary
It fails with the following error from the "release-image-pivot" service:
Details
install-config.yaml:
agent-config.yaml
The install fails after a few minutes the node boots, but the process fails, looping forever
"kubelet" service:
"release-image-pivot" service:
So the "release-image-pivot" fails to start because this problem?:
Other (pertinent?) info:
approve-csr.service
podman
Message from the installer:
Beta Was this translation helpful? Give feedback.
All reactions