Skip to content

Failing "odo link" tests are blocking CI system #4301

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
dharmit opened this issue Dec 10, 2020 · 16 comments · Fixed by #4345 or #4428
Closed

Failing "odo link" tests are blocking CI system #4301

dharmit opened this issue Dec 10, 2020 · 16 comments · Fixed by #4345 or #4428
Assignees
Labels
area/binding Issues or PRs related to `odo add/delete binding *` commands or Service Binding Operator flake Categorizes issue or PR as related to a flaky test. priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)).

Comments

@dharmit
Copy link
Member

dharmit commented Dec 10, 2020

/kind bug
/area linking

odo doesn't fail if at the time of doing odo link <cr-name>/<cr-instance-name>, it fails to create a Secret on the cluster. Creation of this Secret is a task performed by Service Binding Operator. But if odo doesn't fail when such a Secret doesn't get created, it can cause user confusion as well as CI issues as we're seeing in our environment.

Originally posted by @dharmit in #3256 (comment)

@openshift-ci-robot openshift-ci-robot added kind/bug Categorizes issue or PR as related to a bug. area/binding Issues or PRs related to `odo add/delete binding *` commands or Service Binding Operator labels Dec 10, 2020
@dharmit dharmit added the flake Categorizes issue or PR as related to a flaky test. label Dec 10, 2020
@dharmit dharmit self-assigned this Dec 28, 2020
@dharmit
Copy link
Member Author

dharmit commented Dec 28, 2020

Picking this up for Sprint 195 since I'm observing these show up often for periodic jobs.

Failures related to this issue in past two days:

  1. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.5-integration-e2e-periodic/1342893098916646912
  2. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.3-integration-e2e-periodic/1342893097700298752
  3. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.2-integration-e2e-periodic/1342893097691910144
  4. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.5-integration-e2e-periodic/1342983737163386880
  5. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.6-integration-e2e-periodic/1343074297324769280
  6. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.5-integration-e2e-periodic/1343164894966452224
  7. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.5-integration-e2e-periodic/1343255491089797120
  8. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.2-integration-e2e-periodic/1343255489730842624
  9. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.4-integration-e2e-periodic/1343346142175301632
  10. https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.5-integration-e2e-periodic/1343436689812492288

That's 10 failures out of 15 total failures in past two days which are attributed to this issue. 😞

@dharmit dharmit added the triage/needs-information Indicates an issue needs more information in order to work on it. label Dec 28, 2020
@dharmit
Copy link
Member Author

dharmit commented Dec 28, 2020

What's interesting is that this test hasn't failed remotely as often on the integration tests as it has failed in periodic e2e tests. Before I make any change in odo code w.r.t this issue, I'd prefer we have more information about the cause for such erratic beahviour.

@prietyc123 @mohammedzee1000 can you folks shed any light on why we're seeing this mostly in periodic tests only?

@prietyc123
Copy link
Contributor

prietyc123 commented Dec 30, 2020

What's interesting is that this test hasn't failed remotely as often on the integration tests as it has failed in periodic e2e tests.

This is really a weird behaviour and more interestingly we are hitting it on PSI as well https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_odo/4063/pull-ci-openshift-odo-master-v4.6-integration-e2e/1344158957064687616#1:build-log.txt%3A1371 . Kind of blocker for POC pr #4063

Before I make any change in odo code w.r.t this issue, I'd prefer we have more information about the cause for such erratic beahviour.

I think if we want to get more info on this then the only way would be adding more debug logs.

/high-priority

@prietyc123 prietyc123 added the priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)). label Dec 30, 2020
@dharmit
Copy link
Member Author

dharmit commented Dec 30, 2020

I think if we want to get more info on this then the only way would be adding more debug logs.

Can we have access to the system that's hosting the tests when we hit this failure? That would be much more helpful for me to troubleshoot the problem. cc @mohammedzee1000 (for PSI).

@dharmit dharmit changed the title odo should throw error if a Secret is not created during "odo link" Failing "odo link" tests are blocking CI system Jan 5, 2021
@dharmit
Copy link
Member Author

dharmit commented Jan 5, 2021

odo doesn't fail if at the time of doing odo link <cr-name>/<cr-instance-name>, it fails to create a Secret on the cluster. Creation of this Secret is a task performed by Service Binding Operator. But if odo doesn't fail when such a Secret doesn't get created, it can cause user confusion as well as CI issues as we're seeing in our environment.

The above explanation (in the issue description) is based on initial observation. After logging into a cluster where we are seeing this issue repeatedly as a part of PR #4063, I observed that the test spec is filled with a lot of checks. https://github.com/openshift/odo/blob/e6be2586f5824bade91599f24686def26b53f1ee/tests/integration/operatorhub/cmd_service_test.go#L410

The check where things are failing is trying to test an edge case which could be put in a separate spec of its own. https://github.com/openshift/odo/blob/e6be2586f5824bade91599f24686def26b53f1ee/tests/integration/operatorhub/cmd_service_test.go#L448-L453

@dharmit
Copy link
Member Author

dharmit commented Jan 5, 2021

@mohammedzee1000 @prietyc123 I'm going to open a PR that separates the edge case check mentioned in #4301 (comment) into a separate spec. I think it should help fix the issue.

@dharmit
Copy link
Member Author

dharmit commented Jan 12, 2021

@prietyc123 @mohammedzee1000 PTAL #4338 (comment).

I won't be surprised if the problem we were tracking in this issue is still a troublemaker for #4063 and other places (like periodic jobs.)

@prietyc123
Copy link
Contributor

Also hitting the issue more on CI. Recently observed https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/openshift_odo/4317/pull-ci-openshift-odo-master-v4.6-integration-e2e/1349188058104205312#1:build-log.txt%3A1091
I think expectation of the issue has not been met. So, reopening the issue.

@prietyc123 prietyc123 reopened this Jan 13, 2021
@dharmit dharmit removed kind/bug Categorizes issue or PR as related to a bug. triage/needs-information Indicates an issue needs more information in order to work on it. labels Jan 21, 2021
@anandrkskd
Copy link
Contributor

@dharmit
Copy link
Member Author

dharmit commented Feb 10, 2021

This issue is happening because of failure in finding a pod that belongs to the components. Exact place where it's failing is https://github.com/openshift/odo/blob/ede98da442cdfdd8691a9b7ba4b3421f5507bb33/pkg/devfile/adapters/kubernetes/component/adapter.go#L118

@dharmit
Copy link
Member Author

dharmit commented Feb 10, 2021

When I looked into the system, pod failed to come up:

$ kubectl get pods
NAME                      READY   STATUS                       RESTARTS   AGE
example-95glrr7px7        1/1     Running                      0          129m
example-c4n2564mcj        1/1     Running                      0          129m
example-f42jp26w7m        1/1     Running                      0          129m
ydpohr-69b6b8cfb5-gl4km   0/1     CreateContainerConfigError   0          129m

The reason for failure from Events:

Events:
  Type     Reason          Age                     From                         Message
  ----     ------          ----                    ----                         -------
  Normal   Scheduled       129m                    default-scheduler            Successfully assigned vqzqvocxjo/ydpohr-69b6b8cfb5-gl4km to crc-lf65c-master-0
  Normal   AddedInterface  129m                    multus                       Add eth0 [10.116.0.116/23]
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Container image "registry.access.redhat.com/ocp-tools-4/odo-init-container-rhel8:1.1.10" already present on machine
  Normal   Created         129m                    kubelet, crc-lf65c-master-0  Created container copy-supervisord
  Normal   Started         129m                    kubelet, crc-lf65c-master-0  Started container copy-supervisord
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.09965611s
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Error: ErrImagePull
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Failed to pull image "registry.access.redhat.com/ubi8/nodejs-12:1-36": rpc error: code = Unknown desc = can't talk to a V1 docker registry
  Normal   BackOff         129m                    kubelet, crc-lf65c-master-0  Back-off pulling image "registry.access.redhat.com/ubi8/nodejs-12:1-36"
  Warning  Failed          129m                    kubelet, crc-lf65c-master-0  Error: ImagePullBackOff
  Normal   Pulled          129m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.260864191s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.804881284s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.412216534s
  Normal   Pulled          128m                    kubelet, crc-lf65c-master-0  Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.426678755s
  Warning  Failed          74m (x204 over 129m)    kubelet, crc-lf65c-master-0  Error: secret "ydpohr-etcdcluster-example" not found
  Normal   Pulling         9m39s (x457 over 129m)  kubelet, crc-lf65c-master-0  Pulling image "registry.access.redhat.com/ubi8/nodejs-12:1-36"
  Normal   Pulled          4m39s (x462 over 127m)  kubelet, crc-lf65c-master-0  (combined from similar events): Successfully pulled image "registry.access.redhat.com/ubi8/nodejs-12:1-36" in 1.319918792s

@dharmit
Copy link
Member Author

dharmit commented Feb 10, 2021

My understanding right now is as follows.

One thing I'm thinking to try out is to add some time out after the odo push after doing a odo link. My guess is that the pod is taking some time to come up after doing odo link and odo push. By the time it comes up, odo unlink has already deleted the secret which odo push is looking for.

@dharmit
Copy link
Member Author

dharmit commented Feb 11, 2021

One thing I'm thinking to try out is to add some time out after the odo push after doing a odo link.

FWIW, it did help to add some sleep. I've opened #4428 which should hopefully fix this! 🤞

@prietyc123
Copy link
Contributor

@dharmit still I can observe the same failure 🙁 on periodic jobs.

✗  Waiting for component to start [5m] [WARNING x7: Failed]
[odo]  ✗  Failed to start component with name zgiiqh. Error: Failed to create the component: error while waiting for deployment rollout: timeout while waiting for zgiiqh deployment roll out\nFor more information to help determine the cause of the error, re-run with '-v'.
[odo] See below for a list of failed events that occured more than 5 times during deployment:
[odo] 
[odo]  NAME                                      COUNT  REASON  MESSAGE                        
[odo] 
[odo]  zgiiqh-659c8b9987-lbh6g.1670fd23c61ae2bd  7      Failed  Error: secret                  
[odo]                                                           "zgiiqh-etcdcluster-example"   
[odo]                                                           not found                      
[odo] 
[odo] 

Log details: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/periodic-ci-openshift-odo-master-v4.6-integration-e2e-periodic/1376685658797510656#1:build-log.txt%3A1593

Note: Its not ocp version specific as I can also observe it on 4.7

@prietyc123 prietyc123 reopened this Mar 30, 2021
@dharmit
Copy link
Member Author

dharmit commented Mar 30, 2021

@prietyc123 thanks for the info! I expect this to get fixed with #4554 wherein we stop looking for an assumed secret name. The assumption was valid till SBO changed the nomenclature for Secret.

@prietyc123
Copy link
Contributor

Closed via #4554

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/binding Issues or PRs related to `odo add/delete binding *` commands or Service Binding Operator flake Categorizes issue or PR as related to a flaky test. priority/High Important issue; should be worked on before any other issues (except priority/Critical issue(s)).
Projects
None yet
4 participants