Skip to content
This repository was archived by the owner on Jul 25, 2019. It is now read-only.

Fix SDN get and watch resource workflow #241

Merged
merged 12 commits into from
Apr 11, 2016

Conversation

pravisankar
Copy link

@pravisankar
Copy link
Author

Tested on multi-node dev cluster

@danwinship
Copy link
Contributor

So I guess the overall idea is that NewListWatchFromClient() does what watchAndGetResource() was trying to do, except without the bug?

The patches seem good, though I want to look through some parts of it again (and other people looking at them would still be great).

@dcbw
Copy link
Contributor

dcbw commented Jan 12, 2016

Looks like a good cleanup to me, with the exception of the 5 second wait for service watching. That I wish we could make more reliable.

@@ -399,18 +303,18 @@ func (registry *Registry) DeleteNetNamespace(name string) error {
}

func (registry *Registry) GetServicesForNamespace(namespace string) ([]osdnapi.Service, error) {
services, _, err := registry.getServices(namespace)
services, err := registry.getServices(namespace)
return services, err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the removal of the startVersion argument this can just be return registry.getServices(namespace), but then also, you could just move the code from getServices() into this function and have GetServices() call GetServicesForNamespace(kapi.NamespaceAll)

@danwinship
Copy link
Contributor

assuming it passes tests/extended/networking.sh, and fixes 1275904, then LGTM

@pravisankar pravisankar force-pushed the fix-watch-get-resources branch from e9eafe7 to 47df54d Compare January 14, 2016 19:02
@pravisankar pravisankar force-pushed the fix-watch-get-resources branch from 47df54d to 8debfa8 Compare April 6, 2016 18:30
@pravisankar
Copy link
Author

Rebased and patched on top of master branch. Added some more changes to ditch ugly 5 second wait for service watching. Ready for review/merge.
@openshift/networking PTAL

@pravisankar
Copy link
Author

Tested and passed extended networking tests (test/extended/networking.sh)

@dcbw
Copy link
Contributor

dcbw commented Apr 6, 2016

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails, as long as the failure isn't "I'm done"? Otherwise transient errors could quietly kill the event queues and we'll be none-the-wiser from the logs. Also, I assume that's how we know when to terminate, when eventQueue.Pop() returns some "done" error?

Next outside the registry, how does subnets.go::watchSubnets() or watchNodes() know when to break out of the for() loop now that 'oc.stop' is gone?

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

@pravisankar pravisankar force-pushed the fix-watch-get-resources branch from 8debfa8 to bd7cf1d Compare April 6, 2016 23:50
@danwinship
Copy link
Contributor

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

I think this will cause problems with endpoint filtering: the first call to OnEndpointsUpdate() might happen before WatchPods() has completely filled in registry.namespaceOfPodIP, causing warnings and incorrect filtering. It will eventually recover (since it gets asked to filter the entire list of endpoints every time any service changes), but it might cause problems at startup.

@danwinship
Copy link
Contributor

Other than that though, LGTM

@pravisankar
Copy link
Author

On Wed, Apr 6, 2016 at 1:56 PM, Dan Williams [email protected]
wrote:

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails,
as long as the failure isn't "I'm done"? Otherwise transient errors could
quietly kill the event queues and we'll be none-the-wiser from the logs.
Also, I assume that's how we know when to terminate, when eventQueue.Pop()
returns some "done" error?

eventQueue.Pop() is a blocking call and will wait if there are no events in
the queue but yes, it could fail due to various transient errors. Logging
the error will definitely help us during investigating an issue. We could
as well log the error and restart the event queue to overcome any transient
failures.

Next outside the registry, how does subnets.go::watchSubnets() or
watchNodes() know when to break out of the for() loop now that 'oc.stop' is
gone?

oc.Stop() was never called before and currently there is no good way to
call this method. These goroutines will be terminated/killed when the
openshift service is stopped (current behavior).

One behavioral change (though I'm not sure if it matters?) is that now on
startup, the "get all the things" calls will no longer block the main
goroutine but instead are now done from a goroutine. That might cause race
issues, since I think most of the stuff is currently done synchronously on
startup. Not sure though?

Yes, some of the synchronous stuff is done async now. We need to
synchronize if async goroutines depend on each other. populateVNIDMap() is
added to address dependency between watchNetNamespaces and watchServices.
As Winship pointed out in the next comment, I missed on populating
namespaceOfPodIP
that will be needed by proxy. I will fix this issue.

More challenging issue is synchronizing sdn master/node, I have seen 2 bugs
(https://bugzilla.redhat.com/show_bug.cgi?id=1323279 and
https://bugzilla.redhat.com/show_bug.cgi?id=1322130) where the node is yet
to receive NetNamespace event but the service or pod-setup checks for vnid
and fails. This can happen when the master is overloaded or too many
projects are created in a short period.


You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#241 (comment)

@pravisankar
Copy link
Author

Updated, fixed endpoints filtering during startup.
Prefer to tackle other suggestions in a separate PR. Mainly to make watch routines resilient to transient errors and more logging. Winship did some logging work in #286
@openshift/networking PTAL

@danwinship
Copy link
Contributor

LGTM now I think

@pravisankar pravisankar force-pushed the fix-watch-get-resources branch from 1868d12 to a7c0b12 Compare April 8, 2016 21:05
@pravisankar
Copy link
Author

Rebased and resolved merge conflicts.
@openshift/networking please review/merge

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants