flake: process_linux.go:291: setting cgroup config for ready process caused #13241

smarterclayton · 2017-03-06T00:16:59Z

https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin_networking_future/840/consoleFull#201915086156cbb9a5e4b02b88ae8c2f77

18:30:44 Mar  5 18:30:44.675: INFO: At 2017-03-05 18:26:22 -0500 EST - event for weighted-router: {kubelet nettest-node-2} Pulling: pulling image "openshift/origin-haproxy-router"
18:30:44 Mar  5 18:30:44.675: INFO: At 2017-03-05 18:26:22.638375077 -0500 EST - event for execpod: {kubelet nettest-node-1} FailedSync: Error syncing pod, skipping: failed to "StartContainer" for "POD" with RunContainerError: "runContainer: Error response from daemon: {\"message\":\"invalid header field value \\\"oci runtime error: container_linux.go:247: starting container process caused \\\\\\\"process_linux.go:291: setting cgroup config for ready process caused \\\\\\\\\\\\\\\"failed to write a *:* rwm to devices.deny: open /sys/fs/cgroup/devices/system.slice/docker-cca42a09659036290590778ad2440762a9f488265f345d58a56d79cc381ada15.scope/system.slice/docker-cca42a09659036290590778ad2440762a9f488265f345d58a56d79cc381ada15.scope/system.slice/docker-ce71481e3d94f3471a7a48b72a661a4eec507ae799d29f49c2fcda2912fa2224.scope/devices.deny: no such file or directory\\\\\\\\\\\\\\\"\\\\\\\"\\\\n\\\"\"}"
18:30:44

The text was updated successfully, but these errors were encountered:

smarterclayton · 2017-03-06T00:17:41Z

@derekwaynecarr @mrunalp @sjenning @openshift/networking

Have we seen something like that before? This could just be the network tests and DIND, but would like to know for sure.

knobunc · 2017-03-07T15:37:58Z

@dcbw: Any thoughts? Thanks.

danwinship · 2018-02-07T19:06:32Z

There's some sort of dind/cgroup problem. These errors often show up with weird nested cgroup paths like /sys/fs/cgroup/devices/system.slice/docker-871fc6f89264c02a8812d6ebfe72dd6b1215652b59d7f521db3fd01d116e0d7b.scope/system.slice/docker-871fc6f89264c02a8812d6ebfe72dd6b1215652b59d7f521db3fd01d116e0d7b.scope/system.slice/docker-3c93353d77d95c6b6114de44c0ca765d0252df9a56e0321f87c80aa027133d07.scope/devices.deny.

Almost every time you start a pod under dind you get one or two of these:

Feb 06 17:35:20 nettest-node-1 openshift[284]: E0206 17:35:20.419986     284 remote_runtime.go:92] RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to start sandbox container for pod "isolation-webserver": Error response from daemon: invalid header field value "oci runtime error: container_linux.go:247: starting container process caused \"process_linux.go:327: setting cgroup config for procHooks process caused \\\"failed to write a *:* rwm to devices.deny: open /sys/fs/cgroup/devices/system.slice/docker-871fc6f89264c02a8812d6ebfe72dd6b1215652b59d7f521db3fd01d116e0d7b.scope/system.slice/docker-871fc6f89264c02a8812d6ebfe72dd6b1215652b59d7f521db3fd01d116e0d7b.scope/system.slice/docker-3c93353d77d95c6b6114de44c0ca765d0252df9a56e0321f87c80aa027133d07.scope/devices.deny: no such file or directory\\\"\"\n"

But usually eventually it succeeds. Sometimes it doesn't. The bug never seems to happen in non-dind clusters.

I'm guessing either

the bug only happens when you run kubelet in a non-toplevel cgroup (ie, dind), OR
the bug only happens when you specify --enable-cgroups-per-qos=false (which we do in dind because otherwise it fails at startup).

I tried to figure out why dind requires --cgroups-per-qos=false, but I don't really understand any of the cgroup code.

@marun?

Probably we should just kill off the extended_networking_minimal test, and instead make the conformance_gce test run the (minimal) extended networking tests, since it sets up a multi-node environment running the sdn.

smarterclayton · 2018-02-25T17:15:15Z

Just happened again in https://openshift-gce-devel.appspot.com/build/origin-ci-test/pr-logs/pull/18740/test_pull_request_origin_extended_networking_minimal/14711/

danwinship · 2018-02-26T15:42:31Z

Probably we should just kill off the extended_networking_minimal test, and instead make the conformance_gce test run the (minimal) extended networking tests, since it sets up a multi-node environment running the sdn.

#18540 fixes up the tests so we'd be able to do that if you'd like to review it. (Right now the default focus/skips assume that conformance-gce runs ovs-subnet, but that PR makes the skips be figured out at runtime based on whatever plugin is selected so that we can then change conformance-gce to run multitenant and then kill networking-minimal)

dcbw · 2018-02-28T18:37:44Z

Investigated this for a while... I believe the issue is due to the /proc/1/cgroup paths that a container using systemd as PID 1 has. For a DIND container we get:

4:cpu,cpuacct:/system.slice/docker-10cdf0bdb77cc401a9ee9faba70ce36fc712cfd7399ea55439752a49bfe9d427.scope/system.slice/docker-10cdf0bdb77cc401a9ee9faba70ce36fc712cfd7399ea55439752a49bfe9d427.scope/init.scope

Note how the same path is duplicated. Running 'docker -it nginx bash', results in:

4:cpu,cpuacct:/system.slice/docker-df9ecf01a95338243adfb587373ad7781d5610e248e827287257463fda555c4c.scope

and in fact, patching runc (eg, docker) to de-duplicate the cgroup path and using that patched docker inside the "node" container appears to prevent the problem from occurring.

diff -up docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go
--- docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo	2018-02-28 09:44:39.060985054 -0600
+++ docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go	2018-02-28 09:48:44.224528821 -0600
@@ -327,6 +327,38 @@ func ExpandSlice(slice string) (string,
 	return path, nil
 }
 
+func reduce(a string) string {
+	a = strings.TrimSuffix(a, "/")
+	alen := len(a)
+	if alen % 2 != 0 {
+		return a
+	}
+	if a[0:alen/2] == a[alen/2:] {
+		return a[alen/2:]
+	}
+	return a
+}
+
 func getSubsystemPath(c *configs.Cgroup, subsystem string) (string, error) {
 	mountpoint, err := cgroups.FindCgroupMountpoint(subsystem)
 	if err != nil {
@@ -340,6 +371,8 @@ func getSubsystemPath(c *configs.Cgroup,
 	// if pid 1 is systemd 226 or later, it will be in init.scope, not the root
 	initPath = strings.TrimSuffix(filepath.Clean(initPath), "init.scope")
 
+	initPath = reduce(initPath)
+
 	slice := "system.slice"
 	if c.Parent != "" {
 		slice = c.Parent

So next step is tracking down why systemd-based PID1 containers have these odd cgroup paths.

smarterclayton · 2018-02-28T21:42:41Z

This sounds really familiar to something that derek tracked down previously.

…

On Wed, Feb 28, 2018 at 1:37 PM, Dan Williams ***@***.***> wrote: Investigated this for a while... I believe the issue is due to the /proc/1/cgroup paths that a container using systemd as PID 1 has. For a DIND container we get: 4:cpu,cpuacct:/system.slice/docker-10cdf0bdb77cc401a9ee9faba70ce3 6fc712cfd7399ea55439752a49bfe9d427.scope/system.slice/docker- 10cdf0bdb77cc401a9ee9faba70ce36fc712cfd7399ea55439752a49bfe9 d427.scope/init.scope Note how the same path is duplicated. Running 'docker -it nginx bash', results in: 4:cpu,cpuacct:/system.slice/docker-df9ecf01a95338243adfb587373ad7 781d5610e248e827287257463fda555c4c.scope and in fact, patching runc (eg, docker) to de-duplicate the cgroup path and using that patched docker inside the "node" container appears to prevent the problem from occurring. diff -up docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go --- docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo 2018-02-28 09:44:39.060985054 -0600 +++ docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go 2018-02-28 09:48:44.224528821 -0600 @@ -327,6 +327,38 @@ func ExpandSlice(slice string) (string, return path, nil } +func reduce(a string) string { + a = strings.TrimSuffix(a, "/") + alen := len(a) + if alen % 2 != 0 { + return a + } + if a[0:alen/2] == a[alen/2:] { + return a[alen/2:] + } + return a +} + func getSubsystemPath(c *configs.Cgroup, subsystem string) (string, error) { mountpoint, err := cgroups.FindCgroupMountpoint(subsystem) if err != nil { @@ -340,6 +371,8 @@ func getSubsystemPath(c *configs.Cgroup, // if pid 1 is systemd 226 or later, it will be in init.scope, not the root initPath = strings.TrimSuffix(filepath.Clean(initPath), "init.scope") + initPath = reduce(initPath) + slice := "system.slice" if c.Parent != "" { slice = c.Parent So next step is tracking down why systemd-based PID1 containers have these odd cgroup paths. — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub <#13241 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABG_p_YF9Ha-y2sBREKdeiqq4IpKpW-Vks5tZZz7gaJpZM4MTnDe> .

smarterclayton · 2018-02-28T21:43:06Z

Actually, no, this was something we hit with docker in docker systemd when we tried to run docker in a pod. So same use case. We didn't figure out the reason, but it's definitely broken. On Wed, Feb 28, 2018 at 4:42 PM, Clayton Coleman <[email protected]> wrote:

…

This sounds really familiar to something that derek tracked down previously. On Wed, Feb 28, 2018 at 1:37 PM, Dan Williams ***@***.***> wrote: > Investigated this for a while... I believe the issue is due to the > /proc/1/cgroup paths that a container using systemd as PID 1 has. For a > DIND container we get: > > 4:cpu,cpuacct:/system.slice/docker-10cdf0bdb77cc401a9ee9faba > 70ce36fc712cfd7399ea55439752a49bfe9d427.scope/system.slice/d > ocker-10cdf0bdb77cc401a9ee9faba70ce36fc712cfd7399ea55439752a > 49bfe9d427.scope/init.scope > > Note how the same path is duplicated. Running 'docker -it nginx bash', > results in: > > 4:cpu,cpuacct:/system.slice/docker-df9ecf01a95338243adfb5873 > 73ad7781d5610e248e827287257463fda555c4c.scope > > and in fact, patching runc (eg, docker) to de-duplicate the cgroup path > and using that patched docker inside the "node" container appears to > prevent the problem from occurring. > > diff -up docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go > --- docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go.foo 2018-02-28 09:44:39.060985054 -0600 > +++ docker-4402c09586c72e0c32b90d72bd24304f609e2b7a/runc-1c91122c1d992cf1dc971ff14f78eddbf6fb06f5/libcontainer/cgroups/systemd/apply_systemd.go 2018-02-28 09:48:44.224528821 -0600 > @@ -327,6 +327,38 @@ func ExpandSlice(slice string) (string, > return path, nil > } > > +func reduce(a string) string { > + a = strings.TrimSuffix(a, "/") > + alen := len(a) > + if alen % 2 != 0 { > + return a > + } > + if a[0:alen/2] == a[alen/2:] { > + return a[alen/2:] > + } > + return a > +} > + > func getSubsystemPath(c *configs.Cgroup, subsystem string) (string, error) { > mountpoint, err := cgroups.FindCgroupMountpoint(subsystem) > if err != nil { > @@ -340,6 +371,8 @@ func getSubsystemPath(c *configs.Cgroup, > // if pid 1 is systemd 226 or later, it will be in init.scope, not the root > initPath = strings.TrimSuffix(filepath.Clean(initPath), "init.scope") > > + initPath = reduce(initPath) > + > slice := "system.slice" > if c.Parent != "" { > slice = c.Parent > > So next step is tracking down why systemd-based PID1 containers have > these odd cgroup paths. > > — > You are receiving this because you authored the thread. > Reply to this email directly, view it on GitHub > <#13241 (comment)>, > or mute the thread > <https://github.com/notifications/unsubscribe-auth/ABG_p_YF9Ha-y2sBREKdeiqq4IpKpW-Vks5tZZz7gaJpZM4MTnDe> > . >

openshift-bot · 2018-05-30T00:38:04Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-06-29T00:40:27Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

dcbw · 2018-07-27T22:19:24Z

Update: things are fine in the container before pid1/systemd migrate from the initial cgroup to the "init.scope" cgroup. Thats when the duplicate paths appear.

openshift-merge-robot · 2018-10-26T04:09:19Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle stale

openshift-bot · 2018-11-25T06:02:26Z

Stale issues rot after 30d of inactivity.

Mark the issue as fresh by commenting /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.
Exclude this issue from closing by commenting /lifecycle frozen.

If this issue is safe to close now please do so with /close.

/lifecycle rotten
/remove-lifecycle stale

openshift-bot · 2018-12-25T07:54:02Z

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

openshift-ci-robot · 2018-12-25T07:54:13Z

@openshift-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.

Reopen the issue by commenting /reopen.
Mark the issue as fresh by commenting /remove-lifecycle rotten.
Exclude this issue from closing again by commenting /lifecycle frozen.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

smarterclayton added the kind/test-flake Categorizes issue or PR as related to test flakes. label Mar 6, 2017

mfojtik added component/networking priority/P2 labels Mar 6, 2017

pweil- assigned knobunc Mar 6, 2017

knobunc assigned dcbw and unassigned knobunc Mar 7, 2017

danwinship mentioned this issue Feb 7, 2018

[Area:Networking] network isolation when using a plugin that isolates namespaces by default should allow communication from default to non-default namespace on the same node [Suite:openshift/conformance/parallel] #18486

Closed

danwinship mentioned this issue Feb 26, 2018

change how extended network tests are selected/skipped #18540

Merged

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 30, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 29, 2018

dcbw removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Jul 27, 2018

openshift-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 26, 2018

openshift-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Nov 25, 2018

openshift-ci-robot closed this as completed Dec 25, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

flake: process_linux.go:291: setting cgroup config for ready process caused #13241

flake: process_linux.go:291: setting cgroup config for ready process caused #13241

smarterclayton commented Mar 6, 2017

smarterclayton commented Mar 6, 2017

Uh oh!

knobunc commented Mar 7, 2017

Uh oh!

danwinship commented Feb 7, 2018

Uh oh!

smarterclayton commented Feb 25, 2018

Uh oh!

danwinship commented Feb 26, 2018

Uh oh!

dcbw commented Feb 28, 2018

Uh oh!

smarterclayton commented Feb 28, 2018 via email

Uh oh!

smarterclayton commented Feb 28, 2018 via email

Uh oh!

openshift-bot commented May 30, 2018

Uh oh!

openshift-bot commented Jun 29, 2018

Uh oh!

dcbw commented Jul 27, 2018

Uh oh!

openshift-merge-robot commented Oct 26, 2018

Uh oh!

openshift-bot commented Nov 25, 2018

Uh oh!

openshift-bot commented Dec 25, 2018

Uh oh!

openshift-ci-robot commented Dec 25, 2018

Uh oh!

flake: process_linux.go:291: setting cgroup config for ready process caused #13241

flake: process_linux.go:291: setting cgroup config for ready process caused #13241

Comments

smarterclayton commented Mar 6, 2017

smarterclayton commented Mar 6, 2017

Uh oh!

knobunc commented Mar 7, 2017

Uh oh!

danwinship commented Feb 7, 2018

Uh oh!

smarterclayton commented Feb 25, 2018

Uh oh!

danwinship commented Feb 26, 2018

Uh oh!

dcbw commented Feb 28, 2018

Uh oh!

smarterclayton commented Feb 28, 2018 via email

Uh oh!

smarterclayton commented Feb 28, 2018 via email

Uh oh!

openshift-bot commented May 30, 2018

Uh oh!

openshift-bot commented Jun 29, 2018

Uh oh!

dcbw commented Jul 27, 2018

Uh oh!

openshift-merge-robot commented Oct 26, 2018

Uh oh!

openshift-bot commented Nov 25, 2018

Uh oh!

openshift-bot commented Dec 25, 2018

Uh oh!

openshift-ci-robot commented Dec 25, 2018

Uh oh!