Fix SDN get and watch resource workflow #241

pravisankar · 2016-01-12T01:25:53Z

Trello card: https://trello.com/c/hB3SyOLw
Fixes: https://bugzilla.redhat.com/show_bug.cgi?id=1275904

pravisankar · 2016-01-12T01:28:18Z

Tested on multi-node dev cluster

danwinship · 2016-01-12T13:51:04Z

So I guess the overall idea is that NewListWatchFromClient() does what watchAndGetResource() was trying to do, except without the bug?

The patches seem good, though I want to look through some parts of it again (and other people looking at them would still be great).

dcbw · 2016-01-12T15:22:52Z

Looks like a good cleanup to me, with the exception of the 5 second wait for service watching. That I wish we could make more reliable.

danwinship · 2016-01-12T18:41:00Z

plugins/osdn/registry.go

@@ -399,18 +303,18 @@ func (registry *Registry) DeleteNetNamespace(name string) error {
 }

 func (registry *Registry) GetServicesForNamespace(namespace string) ([]osdnapi.Service, error) {
-	services, _, err := registry.getServices(namespace)
+	services, err := registry.getServices(namespace)
 	return services, err


with the removal of the startVersion argument this can just be return registry.getServices(namespace), but then also, you could just move the code from getServices() into this function and have GetServices() call GetServicesForNamespace(kapi.NamespaceAll)

danwinship · 2016-01-12T18:47:01Z

assuming it passes tests/extended/networking.sh, and fixes 1275904, then LGTM

pravisankar · 2016-04-06T18:39:42Z

Rebased and patched on top of master branch. Added some more changes to ditch ugly 5 second wait for service watching. Ready for review/merge.
@openshift/networking PTAL

pravisankar · 2016-04-06T18:41:58Z

Tested and passed extended networking tests (test/extended/networking.sh)

dcbw · 2016-04-06T20:56:53Z

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails, as long as the failure isn't "I'm done"? Otherwise transient errors could quietly kill the event queues and we'll be none-the-wiser from the logs. Also, I assume that's how we know when to terminate, when eventQueue.Pop() returns some "done" error?

Next outside the registry, how does subnets.go::watchSubnets() or watchNodes() know when to break out of the for() loop now that 'oc.stop' is gone?

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

danwinship · 2016-04-07T15:37:29Z

One behavioral change (though I'm not sure if it matters?) is that now on startup, the "get all the things" calls will no longer block the main goroutine but instead are now done from a goroutine. That might cause race issues, since I think most of the stuff is currently done synchronously on startup. Not sure though?

I think this will cause problems with endpoint filtering: the first call to OnEndpointsUpdate() might happen before WatchPods() has completely filled in registry.namespaceOfPodIP, causing warnings and incorrect filtering. It will eventually recover (since it gets asked to filter the entire list of endpoints every time any service changes), but it might cause problems at startup.

danwinship · 2016-04-07T15:37:38Z

Other than that though, LGTM

pravisankar · 2016-04-07T18:40:20Z

On Wed, Apr 6, 2016 at 1:56 PM, Dan Williams [email protected]
wrote:

Good cleanups...

Should we log something in the Watch* functions when the eventQueue fails,
as long as the failure isn't "I'm done"? Otherwise transient errors could
quietly kill the event queues and we'll be none-the-wiser from the logs.
Also, I assume that's how we know when to terminate, when eventQueue.Pop()
returns some "done" error?

eventQueue.Pop() is a blocking call and will wait if there are no events in
the queue but yes, it could fail due to various transient errors. Logging
the error will definitely help us during investigating an issue. We could
as well log the error and restart the event queue to overcome any transient
failures.

Next outside the registry, how does subnets.go::watchSubnets() or
watchNodes() know when to break out of the for() loop now that 'oc.stop' is
gone?

oc.Stop() was never called before and currently there is no good way to
call this method. These goroutines will be terminated/killed when the
openshift service is stopped (current behavior).

One behavioral change (though I'm not sure if it matters?) is that now on
startup, the "get all the things" calls will no longer block the main
goroutine but instead are now done from a goroutine. That might cause race
issues, since I think most of the stuff is currently done synchronously on
startup. Not sure though?

Yes, some of the synchronous stuff is done async now. We need to
synchronize if async goroutines depend on each other. populateVNIDMap() is
added to address dependency between watchNetNamespaces and watchServices.
As Winship pointed out in the next comment, I missed on populating
namespaceOfPodIP
that will be needed by proxy. I will fix this issue.

More challenging issue is synchronizing sdn master/node, I have seen 2 bugs
(https://bugzilla.redhat.com/show_bug.cgi?id=1323279 and
https://bugzilla.redhat.com/show_bug.cgi?id=1322130) where the node is yet
to receive NetNamespace event but the service or pod-setup checks for vnid
and fails. This can happen when the master is overloaded or too many
projects are created in a short period.

—
You are receiving this because you authored the thread.
Reply to this email directly or view it on GitHub
#241 (comment)

pravisankar · 2016-04-07T22:40:08Z

Updated, fixed endpoints filtering during startup.
Prefer to tackle other suggestions in a separate PR. Mainly to make watch routines resilient to transient errors and more logging. Winship did some logging work in #286
@openshift/networking PTAL

danwinship · 2016-04-08T14:23:57Z

LGTM now I think

GetNetNamespaces and GetServices. No longer needed. Remove unused methods GetPods, GetNodes and GetNamespaces.

This will ensure subsequent VNID lookups in watchNamespaces(), watchNetNamespaces() and watchServices() succeeds.

so handle approriately during watching resources(services,namespaces,etc.)

pravisankar · 2016-04-08T21:09:51Z

Rebased and resolved merge conflicts.
@openshift/networking please review/merge

danwinship reviewed Jan 12, 2016
View reviewed changes

pravisankar force-pushed the fix-watch-get-resources branch from e9eafe7 to 47df54d Compare January 14, 2016 19:02

pravisankar force-pushed the fix-watch-get-resources branch from 47df54d to 8debfa8 Compare April 6, 2016 18:30

pravisankar force-pushed the fix-watch-get-resources branch from 8debfa8 to bd7cf1d Compare April 6, 2016 23:50

Ravi Sankar Penta added 12 commits April 8, 2016 13:47

Use NewListWatchFromClient() instead of ListFunc/WatchFunc

54ec139

Don't resync/repopulate the event queue for sdn events

a2f43d3

Remove Get and Watch resource behavior in sdn

9ee426c

Remove unused stop channels in sdn

a7261e1

Don't return resource version for methods GetSubnets,

97665d8

GetNetNamespaces and GetServices. No longer needed. Remove unused methods GetPods, GetNodes and GetNamespaces.

Do not need to pre-populate nodeAddressMap in WatchNodes

88499f8

Prepopulate VNIDMap for VnidStartMaster/VnidStartNode methods

717bf98

This will ensure subsequent VNID lookups in watchNamespaces(), watchNetNamespaces() and watchServices() succeeds.

Minor cleanup in watchServices()

a00e95b

OsdnController.services is only needed in watch services, make it local

66f3787

Existing items in the event queue will be reported as Modified event,

0dff392

so handle approriately during watching resources(services,namespaces,etc.)

Remove unused GetServicesNetwork() and GetHostSubnetLength()

85ff7c9

Prepopulate pod info map so that proxy can filter endpoints

a7c0b12

pravisankar force-pushed the fix-watch-get-resources branch from 1868d12 to a7c0b12 Compare April 8, 2016 21:05

pravisankar mentioned this pull request Apr 9, 2016

Make SDN watch resources more resilient #289

Merged

danwinship merged commit 9f1f602 into openshift:master Apr 11, 2016

pravisankar mentioned this pull request Apr 11, 2016

bump(github.com/openshift/openshift-sdn): 9f1f60258fcef6f0ef647a75a87 openshift/origin#8468

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix SDN get and watch resource workflow #241

Fix SDN get and watch resource workflow #241

Uh oh!

pravisankar commented Jan 12, 2016

Uh oh!

pravisankar commented Jan 12, 2016

Uh oh!

danwinship commented Jan 12, 2016

Uh oh!

dcbw commented Jan 12, 2016

Uh oh!

danwinship Jan 12, 2016

Uh oh!

danwinship commented Jan 12, 2016

Uh oh!

pravisankar commented Apr 6, 2016

Uh oh!

pravisankar commented Apr 6, 2016

Uh oh!

dcbw commented Apr 6, 2016

Uh oh!

danwinship commented Apr 7, 2016

Uh oh!

danwinship commented Apr 7, 2016

Uh oh!

pravisankar commented Apr 7, 2016

Uh oh!

pravisankar commented Apr 7, 2016

Uh oh!

danwinship commented Apr 8, 2016

Uh oh!

pravisankar commented Apr 8, 2016

Uh oh!

Uh oh!

Fix SDN get and watch resource workflow #241

Fix SDN get and watch resource workflow #241

Uh oh!

Conversation

pravisankar commented Jan 12, 2016

Uh oh!

pravisankar commented Jan 12, 2016

Uh oh!

danwinship commented Jan 12, 2016

Uh oh!

dcbw commented Jan 12, 2016

Uh oh!

danwinship Jan 12, 2016

Choose a reason for hiding this comment

Uh oh!

danwinship commented Jan 12, 2016

Uh oh!

pravisankar commented Apr 6, 2016

Uh oh!

pravisankar commented Apr 6, 2016

Uh oh!

dcbw commented Apr 6, 2016

Uh oh!

danwinship commented Apr 7, 2016

Uh oh!

danwinship commented Apr 7, 2016

Uh oh!

pravisankar commented Apr 7, 2016

Uh oh!

pravisankar commented Apr 7, 2016

Uh oh!

danwinship commented Apr 8, 2016

Uh oh!

pravisankar commented Apr 8, 2016

Uh oh!

Uh oh!