Skip to content

Possible criu v4.1 regression in Fedora #2650

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kolyshkin opened this issue Apr 18, 2025 · 15 comments · Fixed by #2653
Closed

Possible criu v4.1 regression in Fedora #2650

kolyshkin opened this issue Apr 18, 2025 · 15 comments · Fixed by #2653

Comments

@kolyshkin
Copy link
Contributor

When criu v4.1 is used in runc CI tests in Fedora 41, it fails like this:

not ok 39 checkpoint and restore in external network namespace
# (in test file tests/integration/checkpoint.bats, line 304)
#   `[ "$status" -eq 0 ]' failed
# runc spec (status=0):
#
# external_net_ns is supported
# runc run -d --console-socket /tmp/bats-run-Z4hgub/runc.XrBtvk/tty/sock test_busybox (status=0):
#
# runc state test_busybox (status=0):
# {
#   "ociVersion": "1.2.1",
#   "id": "test_busybox",
#   "pid": 26103,
#   "status": "running",
#   "bundle": "/tmp/bats-run-Z4hgub/runc.XrBtvk/bundle",
#   "rootfs": "/tmp/bats-run-Z4hgub/runc.XrBtvk/bundle/rootfs",
#   "created": "2025-04-18T00:43:29.896678866Z",
#   "owner": ""
# }
# runc checkpoint --work-path ./work-dir test_busybox (status=0):
#
# runc state test_busybox (status=1):
# time="2025-04-18T00:43:30Z" level=error msg="container does not exist"
# runc restore -d --work-path ./work-dir --console-socket /tmp/bats-run-Z4hgub/runc.XrBtvk/tty/sock test_busybox (status=1):
# time="2025-04-18T00:43:30Z" level=warning msg="--- Quoting \"work-dir/restore.log\""
# time="2025-04-18T00:43:30Z" level=warning msg="382:(00.036384)      1: mnt-v2: Move mount 97 from /tmp/.criu.mntns.IhNw5B/mnt-0000000097 to /tmp/.criu.mntns.IhNw5B/13-0000000000/dev/console"
# time="2025-04-18T00:43:30Z" level=warning msg="383:(00.036402)      1: mnt-v2: Move mount 376 from /tmp/.criu.mntns.IhNw5B/mnt-0000000376 to /tmp/.criu.mntns.IhNw5B/13-0000000000/sys"
# time="2025-04-18T00:43:30Z" level=warning msg="384:(00.036420)      1: mnt-v2: Move mount 380 from /tmp/.criu.mntns.IhNw5B/mnt-0000000380 to /tmp/.criu.mntns.IhNw5B/13-0000000000/sys/fs/cgroup"
# time="2025-04-18T00:43:30Z" level=warning msg="385:(00.036439)      1: mnt-v2: Move mount 113 from /tmp/.criu.mntns.IhNw5B/mnt-0000000113 to /tmp/.criu.mntns.IhNw5B/13-0000000000/sys/firmware"
# time="2025-04-18T00:43:30Z" level=warning msg="386:(00.036451)      1: mnt: Move the root to /tmp/.criu.mntns.IhNw5B/13-0000000000"
# time="2025-04-18T00:43:30Z" level=warning msg="387:Error: Could not process rule: File exists"
# time="2025-04-18T00:43:30Z" level=warning msg="388:create table inet CRIU-9de6ed64-f981-441a-9e"
# time="2025-04-18T00:43:30Z" level=warning msg="389:                  ^^^^^^^^^^^^^^^^^^^^^^^^^^"
# time="2025-04-18T00:43:30Z" level=warning msg="390:(00.045260) Error (criu/net.c:3256): net: Locking network failed using nftables"
# time="2025-04-18T00:43:30Z" level=warning msg="391:(00.045808) mnt: Switching to new ns to clean ghosts"
# time="2025-04-18T00:43:30Z" level=warning msg="392:(00.046057) Error (criu/cr-restore.c:2320): Restoring FAILED."
# time="2025-04-18T00:43:30Z" level=warning msg="393:(00.047228) Error (criu/cgroup.c:1998): cg: cgroupd: recv req error: No such file or directory"
# time="2025-04-18T00:43:30Z" level=warning msg=---
# time="2025-04-18T00:43:30Z" level=error msg="criu failed: type RESTORE errno 0"
# --- teardown ---

The test case source is here: https://github.com/opencontainers/runc/blob/e55fe63aed22520d565d1a3490e1655e839068eb/tests/integration/checkpoint.bats#L271

This is the first time I am seeing this, so I presume it's a regression in criu v4.1. Might be a hash collision but I very much doubt it.

@kolyshkin
Copy link
Contributor Author

It appears to be a consistent failure, can also easy reproduce locally. Here are the repro steps on a Fedora 41 system:

sudo dnf -y --enablerepo=updates-testing update criu
git clone https://github.com/opencontainers/runc
cd runc
make runc test-binaries
sudo bats tests/integration/checkpoint.bats

Note it does not fail this way on Ubuntu 24.04.

@avagin
Copy link
Member

avagin commented Apr 19, 2025

It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here:
https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88

another way is to add --fault 135 to a config file.

@kolyshkin
Copy link
Contributor Author

git-bisect points to commit 867c773.

Indeed, if I compile criu v4.1 without adding NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES, the tests pass.

@kolyshkin
Copy link
Contributor Author

It can be a pagemap_scan issue. Could you compile CRIU without pagemap_scan and try it out? You can change the code here: https://github.com/checkpoint-restore/criu/blob/criu-dev/criu/kerndat.c#L88

another way is to add --fault 135 to a config file.

@avagin did you mean to add this comment to #2551?

@kolyshkin
Copy link
Contributor Author

git-bisect points to commit 867c773.

Indeed, if I compile criu v4.1 without adding NETWORK_LOCK_DEFAULT=NETWORK_LOCK_NFTABLES, the tests pass.

@adrianreber PTAL

@ricardobranco777
Copy link

Same in openSUSE Tumbleweed.

@adrianreber
Copy link
Member

This is indeed new functionality. Not sure if we should have switched in the middle of Fedora 41 from iptables based locking to nftables based locking (@rst0git) but this would have delayed this report just until Fedora 42 is used.

I am able to see it locally and I think I understand what is going on. CRIU is locking the network with the same ID it was locked during checkpointing. I can see that the network locking is still active by inserting following line before the restore:
ip netns exec "$ns_name" nft list tables.

The idea of the change was we create a uuid, lock the network with a nft table using that uuid during checkpointing. During restore we check if the checkpoint image has a uuid and use that uuid again for unlocking. It seems I missed the possibility that the network is also locked during restore. I thought it is only unlocked during restore.

Without looking at the code it is not yet clear why CRIU is locking the network during restore. As mentioned, I was only expecting unlocking. So either we need to use another name for the network locking during restore or re-use the nft table instead of creating a new one.

@avagin any recommendations from your side? Do you know if we can re-use the existing table to lock during restore or should we create a new table for network locking during restore?

I will look at the code in a couple of days, but it should be fixable.

A quick workaround could be a configuration file with network-lock skip or network-lock iptables.

avagin added a commit to avagin/criu that referenced this issue Apr 21, 2025
CRIU attempts to lock the network during restore in an "empty" network
namespace. However, "empty" in this context means CRIU isn't restoring
the namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is alwady locked in it.

Fixes checkpoint-restore#2650

Signed-off-by: Andrei Vagin <[email protected]>
avagin added a commit to avagin/criu that referenced this issue Apr 21, 2025
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.

Fixes checkpoint-restore#2650

Signed-off-by: Andrei Vagin <[email protected]>
avagin added a commit to avagin/criu that referenced this issue Apr 21, 2025
CRIU locks the network during restore in an "empty" network namespace.
However, "empty" in this context means CRIU isn't restoring the
namespace. This network namespace can be the same namespace where
processes have been dumped and so the network is already locked in it.

Fixes checkpoint-restore#2650

Signed-off-by: Andrei Vagin <[email protected]>
@adrianreber
Copy link
Member

@avagin opened a PR with a possible fix. Once it is merged we can update the Fedora packages with it.

@ricardobranco777 can you bring the patch to the openSUSE packages once it is merged?

@ricardobranco777
Copy link

ricardobranco777 commented Apr 21, 2025

@ricardobranco777 can you bring the patch to the openSUSE packages once it is merged?

Sure. Thanks!

@ricardobranco777
Copy link

@ricardobranco777 can you bring the patch to the openSUSE packages once it is merged?

Tumbleweed is a rolling release and tries not to ship downstream patches if possible, but can pick up new versions. Will you make a new release?

@adrianreber
Copy link
Member

@ricardobranco777 can you bring the patch to the openSUSE packages once it is merged?

Tumbleweed is a rolling release and tries not to ship downstream patches if possible, but can pick up new versions. Will you make a new release?

I wouldn't expect a new release for this small change. For Fedora I have no problem just applying a patch. In the past CRIU didn't release a new version for minor changes like this.

It would not be really a downstream only patch as it is in the upstream repository.

@adrianreber
Copy link
Member

@kolyshkin Updated Fedora packages are heading towards the testing repository https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

@kolyshkin
Copy link
Contributor Author

@kolyshkin Updated Fedora packages are heading towards the testing repository https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Thank you! In the meantime criu-4.1 got promoted to updates and so runc CI is busted again 😢

kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 21, 2025
This version has a known bug [1] which is going to be fixed in the
upcoming criu release [2]. So, let's skip criu testing on Fedora
until a newer criu rpm is available.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 22, 2025
This version has a known bug [1] which is going to be fixed in the
upcoming criu release [2]. So, let's skip criu testing on Fedora
until a newer criu rpm is available.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 22, 2025
This version has a known bug [1] which is going to be fixed in the
upcoming criu release [2]. So, let's skip criu testing on Fedora
until a newer criu rpm is available.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 22, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 22, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit 281e7dc)
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 22, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit 281e7dc)
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 23, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 23, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit 3e3e048)
Signed-off-by: Kir Kolyshkin <[email protected]>
kolyshkin added a commit to kolyshkin/runc that referenced this issue Apr 23, 2025
Package criu-4.1-1 has a known bug [1] which is fixed in criu-4.1-2 [2],
which is currently only available in updates-testing. Add a kludge to
install newer criu if necessary to fix CI.

This will not be needed in ~2 weeks once the new package is promoted to
updates.

[1]: checkpoint-restore/criu#2650
[2]: https://bodhi.fedoraproject.org/updates/FEDORA-2025-d374d8ce17

Signed-off-by: Kir Kolyshkin <[email protected]>
(cherry picked from commit 3e3e048)
Signed-off-by: Kir Kolyshkin <[email protected]>
@ricardobranco777
Copy link

Tracking upstream https://bugzilla.suse.com/show_bug.cgi?id=1241515

@kolyshkin
Copy link
Contributor Author

Tracking upstream https://bugzilla.suse.com/show_bug.cgi?id=1241515

Yes, this is fixed and you might want to add a patch from #2653 into your build. Similar fix in Fedora: https://src.fedoraproject.org/rpms/criu/c/323d01daa05d3d402d05114c21904b645ad755ba

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants