Failing "odo link" tests are blocking CI system #4301

dharmit · 2020-12-10T12:12:01Z

/kind bug
/area linking

odo doesn't fail if at the time of doing odo link <cr-name>/<cr-instance-name>, it fails to create a Secret on the cluster. Creation of this Secret is a task performed by Service Binding Operator. But if odo doesn't fail when such a Secret doesn't get created, it can cause user confusion as well as CI issues as we're seeing in our environment.

Originally posted by @dharmit in #3256 (comment)

The text was updated successfully, but these errors were encountered:

dharmit · 2020-12-28T09:23:49Z

Picking this up for Sprint 195 since I'm observing these show up often for periodic jobs.

Failures related to this issue in past two days:

That's 10 failures out of 15 total failures in past two days which are attributed to this issue. 😞

dharmit · 2020-12-28T10:02:24Z

What's interesting is that this test hasn't failed remotely as often on the integration tests as it has failed in periodic e2e tests. Before I make any change in odo code w.r.t this issue, I'd prefer we have more information about the cause for such erratic beahviour.

@prietyc123 @mohammedzee1000 can you folks shed any light on why we're seeing this mostly in periodic tests only?

prietyc123 · 2020-12-30T12:19:55Z

What's interesting is that this test hasn't failed remotely as often on the integration tests as it has failed in periodic e2e tests.

This is really a weird behaviour and more interestingly we are hitting it on PSI as well https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_odo/4063/pull-ci-openshift-odo-master-v4.6-integration-e2e/1344158957064687616#1:build-log.txt%3A1371 . Kind of blocker for POC pr #4063

Before I make any change in odo code w.r.t this issue, I'd prefer we have more information about the cause for such erratic beahviour.

I think if we want to get more info on this then the only way would be adding more debug logs.

/high-priority

dharmit · 2020-12-30T13:10:27Z

I think if we want to get more info on this then the only way would be adding more debug logs.

Can we have access to the system that's hosting the tests when we hit this failure? That would be much more helpful for me to troubleshoot the problem. cc @mohammedzee1000 (for PSI).

dharmit · 2021-01-05T12:20:21Z

odo doesn't fail if at the time of doing odo link <cr-name>/<cr-instance-name>, it fails to create a Secret on the cluster. Creation of this Secret is a task performed by Service Binding Operator. But if odo doesn't fail when such a Secret doesn't get created, it can cause user confusion as well as CI issues as we're seeing in our environment.

The above explanation (in the issue description) is based on initial observation. After logging into a cluster where we are seeing this issue repeatedly as a part of PR #4063, I observed that the test spec is filled with a lot of checks. https://github.com/openshift/odo/blob/e6be2586f5824bade91599f24686def26b53f1ee/tests/integration/operatorhub/cmd_service_test.go#L410

The check where things are failing is trying to test an edge case which could be put in a separate spec of its own. https://github.com/openshift/odo/blob/e6be2586f5824bade91599f24686def26b53f1ee/tests/integration/operatorhub/cmd_service_test.go#L448-L453

dharmit · 2021-01-05T12:21:57Z

@mohammedzee1000 @prietyc123 I'm going to open a PR that separates the edge case check mentioned in #4301 (comment) into a separate spec. I think it should help fix the issue.

dharmit · 2021-01-12T06:51:52Z

@prietyc123 @mohammedzee1000 PTAL #4338 (comment).

I won't be surprised if the problem we were tracking in this issue is still a troublemaker for #4063 and other places (like periodic jobs.)

prietyc123 · 2021-01-13T15:51:08Z

Also hitting the issue more on CI. Recently observed https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_odo/4317/pull-ci-openshift-odo-master-v4.6-integration-e2e/1349188058104205312#1:build-log.txt%3A1091
I think expectation of the issue has not been met. So, reopening the issue.

anandrkskd · 2021-02-10T06:50:27Z

we are hitting this issue on PSI CI jobs for PR and periodic jobs.

https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.4-integration-e2e-periodic/1359200489765343232

dharmit · 2021-02-10T11:47:05Z

This issue is happening because of failure in finding a pod that belongs to the components. Exact place where it's failing is https://github.com/openshift/odo/blob/ede98da442cdfdd8691a9b7ba4b3421f5507bb33/pkg/devfile/adapters/kubernetes/component/adapter.go#L118

dharmit · 2021-02-10T11:48:52Z

When I looked into the system, pod failed to come up:

$ kubectl get pods
NAME                      READY   STATUS                       RESTARTS   AGE
example-95glrr7px7        1/1     Running                      0          129m
example-c4n2564mcj        1/1     Running                      0          129m
example-f42jp26w7m        1/1     Running                      0          129m
ydpohr-69b6b8cfb5-gl4km   0/1     CreateContainerConfigError   0          129m

The reason for failure from Events:

Events:
  Type     Reason          Age                     From                         Message
  ----     ------          ----                    ----                         -------
  Normal   Scheduled       129m                    default-scheduler            Successfully assigned vqzqvocxjo/ydpohr-69b6b8cfb5-gl4km to crc-lf65c-master-0
  Normal   AddedInterface  129m                    multus                       Add eth0 [10.116.0.116/23]
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Container image "registry.access.redhat.com/ocp-tools-4/odo-init-container-rhel8:1.1.10" already present on machine
  Normal   Created         129m                    kubelet, crc-lf65c-master-0  Created container copy-supervisord
  Normal   Started         129m                    kubelet, crc-lf65c-master-0  Started container copy-supervisord
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.09965611s
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Error: ErrImagePull
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Failed to pull image "registry.access.redhat.com/ubi8/nodejs-12:1-36": rpc error: code = Unknown desc = can't talk to a V1 docker registry
  Normal   BackOff         129m                    kubelet, crc-lf65c-master-0  Back-off pulling image "registry.access.redhat.com/ubi8/nodejs-12:1-36"
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Error: ImagePullBackOff
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.260864191s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.804881284s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.412216534s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.426678755s
  Warning  Failed          74m (x204 over 129m)    kubelet, crc-lf65c-master-0  Error: secret "ydpohr-etcdcluster-example" not found
  Normal   Pulling         9m39s (x457 over 129m)  kubelet, crc-lf65c-master-0  Pulling image "registry.access.redhat.com/ubi8/nodejs-12:1-36"
  Normal   Pulled          4m39s (x462 over 127m)  kubelet, crc-lf65c-master-0  (combined from similar events): Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.319918792s

dharmit · 2021-02-10T12:11:07Z

My understanding right now is as follows.

We first link (odo link) a component to a service (here EtcdCluster/example) and do odo push. odo link creates an instance of ServiceBinding which creates a Kubernetes secret that stores info about the service that has to be injected into the component.
The way this info gets injected is that the Secret is injected through envFrom into the component's Pod's spec: https://github.com/openshift/odo/blob/ede98da442cdfdd8691a9b7ba4b3421f5507bb33/pkg/devfile/adapters/kubernetes/component/adapter.go#L163
Then, in the same test spec, we do odo unlink to check if the disconnect process happens as expected. Unlink process removes link info from env.yaml. But before it could be removed from the Pod's spec we first check if the pod exists: https://github.com/openshift/odo/blob/ede98da442cdfdd8691a9b7ba4b3421f5507bb33/pkg/devfile/adapters/kubernetes/component/adapter.go#L118
And this check fails to find a "running" pod because our pod failed to start with events mentioned in above comment.

One thing I'm thinking to try out is to add some time out after the odo push after doing a odo link. My guess is that the pod is taking some time to come up after doing odo link and odo push. By the time it comes up, odo unlink has already deleted the secret which odo push is looking for.

dharmit · 2021-02-11T09:28:30Z

One thing I'm thinking to try out is to add some time out after the odo push after doing a odo link.

FWIW, it did help to add some sleep. I've opened #4428 which should hopefully fix this! 🤞

prietyc123 · 2021-03-30T06:00:19Z

@dharmit still I can observe the same failure 🙁 on periodic jobs.

✗  Waiting for component to start [5m] [WARNING x7: Failed]
[odo]  ✗  Failed to start component with name zgiiqh. Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for zgiiqh deployment roll out\nFor more information to help determine the cause of the error, re-run with '-v'.
[odo] See below for a list of failed events that occured more than 5 times during deployment:
[odo] 
[odo]  NAME                                      COUNT  REASON  MESSAGE                        
[odo] 
[odo]  zgiiqh-659c8b9987-lbh6g.1670fd23c61ae2bd  7      Failed  Error: secret                  
[odo]                                                           "zgiiqh-etcdcluster-example"   
[odo]                                                           not found                      
[odo] 
[odo]

Log details: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.6-integration-e2e-periodic/1376685658797510656#1:build-log.txt%3A1593

Note: Its not ocp version specific as I can also observe it on 4.7

dharmit · 2021-03-30T06:07:08Z

@prietyc123 thanks for the info! I expect this to get fixed with #4554 wherein we stop looking for an assumed secret name. The assumption was valid till SBO changed the nomenclature for Secret.

prietyc123 · 2021-04-01T06:05:39Z

Closed via #4554

openshift-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/binding Issues or PRs related to `odo add/delete binding *` commands or Service Binding Operator labels Dec 10, 2020

dharmit added the flake Categorizes issue or PR as related to a flaky test. label Dec 10, 2020

dharmit self-assigned this Dec 28, 2020

dharmit added the triage/needs-information Indicates an issue needs more information in order to work on it. label Dec 28, 2020

dharmit mentioned this issue Dec 28, 2020

Push fails to sync files to the component #3754

Closed

prietyc123 added the priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). label Dec 30, 2020

dharmit changed the title ~~odo should throw error if a Secret is not created during "odo link"~~ Failing "odo link" tests are blocking CI system Jan 5, 2021

dharmit added the kind/failing-test label Jan 5, 2021

dharmit mentioned this issue Jan 5, 2021

Separates test into its own spec #4345

Merged

4 tasks

prietyc123 mentioned this issue Jan 8, 2021

Revert centos nodejs tags to older version #4348

Merged

4 tasks

openshift-merge-robot closed this as completed in #4345 Jan 8, 2021

prietyc123 reopened this Jan 13, 2021

prietyc123 mentioned this issue Jan 13, 2021

Some fix for configure script of s390x. #4317

Merged

4 tasks

dharmit removed kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jan 21, 2021

prietyc123 mentioned this issue Feb 9, 2021

Fixing wrong project passed to the component #4416

Merged

4 tasks

This was referenced Feb 10, 2021

Adds an error to be ignored while listing services #4420

Merged

adding project resource check before warning about default project usage #4396

Merged

This was referenced Feb 11, 2021

Added java-maven to the list of components to be tested #4427

Merged

Buming odo version for release #4418

Merged

Wait for pods to come up before running 'odo unlink' to avoid race condition #4428

Merged

anandrkskd mentioned this issue Feb 25, 2021

Automate psi ci for mac and windows #4460

Merged

4 tasks

openshift-merge-robot closed this as completed in #4428 Mar 8, 2021

prietyc123 reopened this Mar 30, 2021

prietyc123 closed this as completed Apr 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failing "odo link" tests are blocking CI system #4301

Failing "odo link" tests are blocking CI system #4301

dharmit commented Dec 10, 2020

dharmit commented Dec 28, 2020 •

edited

Loading

Uh oh!

dharmit commented Dec 28, 2020

Uh oh!

prietyc123 commented Dec 30, 2020 •

edited

Loading

Uh oh!

dharmit commented Dec 30, 2020

Uh oh!

dharmit commented Jan 5, 2021

Uh oh!

dharmit commented Jan 5, 2021

Uh oh!

dharmit commented Jan 12, 2021

Uh oh!

prietyc123 commented Jan 13, 2021

Uh oh!

anandrkskd commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 11, 2021

Uh oh!

prietyc123 commented Mar 30, 2021

Uh oh!

dharmit commented Mar 30, 2021

Uh oh!

prietyc123 commented Apr 1, 2021

Uh oh!

Failing "odo link" tests are blocking CI system #4301

Failing "odo link" tests are blocking CI system #4301

Comments

dharmit commented Dec 10, 2020

dharmit commented Dec 28, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dharmit commented Dec 28, 2020

Uh oh!

prietyc123 commented Dec 30, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dharmit commented Dec 30, 2020

Uh oh!

dharmit commented Jan 5, 2021

Uh oh!

dharmit commented Jan 5, 2021

Uh oh!

dharmit commented Jan 12, 2021

Uh oh!

prietyc123 commented Jan 13, 2021

Uh oh!

anandrkskd commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 10, 2021

Uh oh!

dharmit commented Feb 11, 2021

Uh oh!

prietyc123 commented Mar 30, 2021

Uh oh!

dharmit commented Mar 30, 2021

Uh oh!

prietyc123 commented Apr 1, 2021

Uh oh!

dharmit commented Dec 28, 2020 •

edited

Loading

prietyc123 commented Dec 30, 2020 •

edited

Loading