Add initialDelay #303

Vincent056 · 2023-02-15T23:02:09Z

We added initialDelay option to FileIntegrity CRD to allow users to specify the initial delay before the first scan is run. This is useful for environments where the operator is deployed before the cluster is fully ready.
Launch of aide demonset will be delayed according to the value of initialDelay set in FileIntegrity Object.

To launch a FileIntegrity Check for worker nodes with a delay of 100s you can create the following object:

apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: worker-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/worker: ""
  config:
      initialDelay: 100

rhmdnd

Thanks for getting this proposed!

A few suggestions inline along with a recommendation to make the feature non-blocking.

rhmdnd · 2023-02-16T15:16:00Z

config/crd/bases/fileintegrity.openshift.io_fileintegrities.yaml

@@ -44,6 +44,11 @@ spec:
                    default: 900
                    description: Time between individual aide scans
                    type: integer
+                  initialDelay:
+                    description: InitalDelaySeconds is the number of seconds to wait


nit: InitialDelaySeconds

"first scan. It"

pkg/controller/fileintegrity/fileintegrity_controller.go

tests/e2e/e2e_test.go

rhmdnd · 2023-02-16T15:56:56Z

pkg/controller/fileintegrity/fileintegrity_controller.go

+		} else {
+			// sleep for the initial delay
+			reqLogger.Info("InitialDelaySeconds set, sleeping", "InitialDelaySeconds", instance.Spec.Config.InitialDelay)
+			time.Sleep(time.Duration(instance.Spec.Config.InitialDelay) * time.Second)


Based on my testing - this is a blocking operation and could to confusion with multiple file integrity resources.

I tested this by creating two different scans:

apiVersion: fileintegrity.openshift.io/v1alpha1 kind: FileIntegrity metadata: name: master-fileintegrity namespace: openshift-file-integrity spec: nodeSelector: node-role.kubernetes.io/master: "" config: name: master-aide-conf namespace: openshift-file-integrity initialDelay: 120 $ cat worker-fio.yaml apiVersion: fileintegrity.openshift.io/v1alpha1 kind: FileIntegrity metadata: name: worker-fileintegrity namespace: openshift-file-integrity spec: nodeSelector: node-role.kubernetes.io/worker: "" config: name: worker-aide-conf namespace: openshift-file-integrity initialDelay: 120

I created them at the same time. I expect that the daemonSets for each would be available at the same time (since they use the same initial delay and I created them around the same time).

What I observed is that the timer for the second file integrity starts after the first is created because the reconcile loop is sleeping.

NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE daemonset.apps/aide-master-fileintegrity 3 3 3 3 3 node-role.kubernetes.io/master= 2m6s daemonset.apps/aide-worker-fileintegrity 2 2 2 2 2 node-role.kubernetes.io/worker= 6s

Is it possible to re-queue the request and determine this based on the request timestamp?

I used the following to make this a non-blocking operation.

$ git d pkg/controller/ diff --git a/pkg/controller/fileintegrity/fileintegrity_controller.go b/pkg/controller/fileintegrity/fileintegrity_controller.go index c18a49d8..37b7cb5a 100644 --- a/pkg/controller/fileintegrity/fileintegrity_controller.go +++ b/pkg/controller/fileintegrity/fileintegrity_controller.go @@ -440,13 +440,16 @@ func (r *FileIntegrityReconciler) FileIntegrityControllerReconcile(request recon return reconcile.Result{}, legacyDeleteErr } - // check if we have initialDelay set - if instance.Spec.Config.InitialDelay == 0 { - reqLogger.Info("InitialDelaySeconds not set, creating deamonset now") - } else { - // sleep for the initial delay - reqLogger.Info("InitialDelaySeconds set, sleeping", "InitialDelaySeconds", instance.Spec.Config.InitialDelay) - time.Sleep(time.Duration(instance.Spec.Config.InitialDelay) * time.Second) + // Check if we're past the initial delay timer by evaluating + // the time since creation. + n := time.Now() + d := n.Sub(instance.GetCreationTimestamp().Time) + + if d.Seconds() < float64(instance.Spec.Config.InitialDelay) { + s := fmt.Sprintf("Re-queuing request for %s because elapsed time since creation (%f seconds) hasn't exceeded InitialDelay of %d seconds", instance.Name, d.Seconds(), instance.Spec.Config.InitialDelay) + reqLogger.Info(s) + // requeue the request + return reconcile.Result{Requeue: true}, nil } reqLogger.Info("Creating daemonSet", "DaemonSet", daemonSetName)

that's a good catch, thanks for it, this works now!

Vincent056 · 2023-02-16T18:16:57Z

Thanks for getting this proposed!

A few suggestions inline along with a recommendation to make the feature non-blocking.

Thanks for the throughout review, let me address those issues

jhrozek · 2023-02-17T08:49:26Z

pkg/controller/fileintegrity/fileintegrity_controller.go

+		// the time since creation.
+		d := time.Since(instance.CreationTimestamp.Time)
+
+		if d.Seconds() < float64(instance.Spec.Config.InitialDelay) {


I think it would be more idiomatic to try to use the time.Before() or time.After methods.
Something like:

shouldScheduleAt := instance.CreationTimestamp.Time.Add(instance.Spec.Config.InitialDelay) if time.Now().Before(shouldScheduleAt) { // reconcile }

Thanks for the review, this makes sense, just updated the code

Vincent056 · 2023-02-17T09:15:27Z

Have success e2e run on my local cluster, it looks like we have some issues with must gather

rhmdnd

Code changes look good. Just have to figure out the e2e issues.

My comments were addressed.

Vincent056 · 2023-02-17T22:39:06Z

/retest

Vincent056 · 2023-02-20T20:35:30Z

/retest

We added initalDelay option to FileIntegrity CRD to allow users to specify the initial delay before the first scan is run. This is useful for environments where the operator is deployed before cluster is fully ready.

Vincent056 · 2023-02-23T14:58:22Z

/retest

xiaojiey · 2023-02-23T14:59:30Z

Verification pass with 4.13.0-0.nightly-2023-02-23-000625 and code in the PR:

1. deploy FIO with code in the PR
2. Create fileintegrity with initialDelay set to 300, and check results:
$ oc create cm worker-aide-conf --from-file=aide-conf=aide.conf.rhel8
configmap/worker-aide-conf created
$ oc create cm master-aide-conf --from-file=aide-conf=aide.conf.rhel8.1
configmap/master-aide-conf created
$ oc apply -f -<<EOF
---
apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: worker-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/worker: ""
  config:
      name: worker-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 300
      gracePeriod: 300
---
apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: master-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/master: ""
  config:
      name: master-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 300
      gracePeriod: 300
EOF
fileintegrity.fileintegrity.openshift.io/worker-fileintegrity created
fileintegrity.fileintegrity.openshift.io/master-fileintegrity created
$ oc get daemonsets.apps  -w
^C
$ oc get fileintegrity
NAME                   AGE
master-fileintegrity   4m42s
worker-fileintegrity   4m43s
$ oc get daemonsets.apps  -w
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
aide-worker-fileintegrity   0         0         0       0            0           node-role.kubernetes.io/worker=   0s
aide-worker-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/worker=   0s
aide-master-fileintegrity   0         0         0       0            0           node-role.kubernetes.io/master=   0s
aide-master-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/master=   0s
^C
$ oc get fileintegrity
NAME                   AGE
master-fileintegrity   5m4s
worker-fileintegrity   5m5s
$ oc get daemonsets.apps  -w
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
aide-master-fileintegrity   3         3         1       3            1           node-role.kubernetes.io/master=   8s
aide-worker-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/worker=   9s
aide-worker-fileintegrity   3         3         1       3            1           node-role.kubernetes.io/worker=   17s
aide-worker-fileintegrity   3         3         2       3            2           node-role.kubernetes.io/worker=   19s
aide-worker-fileintegrity   3         3         3       3            3           node-role.kubernetes.io/worker=   19s
aide-master-fileintegrity   3         3         2       3            2           node-role.kubernetes.io/master=   20s
aide-master-fileintegrity   3         3         3       3            3           node-role.kubernetes.io/master=   21s
^C
$ oc get pod
NAME                                      READY   STATUS    RESTARTS      AGE
aide-master-fileintegrity-8d5n6           1/1     Running   0             42s
aide-master-fileintegrity-qhqsp           1/1     Running   0             42s
aide-master-fileintegrity-tdrxn           1/1     Running   0             42s
aide-worker-fileintegrity-8vt5t           1/1     Running   0             43s
aide-worker-fileintegrity-rbzpp           1/1     Running   0             43s
aide-worker-fileintegrity-wgrcs           1/1     Running   0             43s
file-integrity-operator-9dd874f46-888hn   1/1     Running   1 (34m ago)   35m
$ oc logs pod/aide-master-fileintegrity-8d5n6 --all-containers 
2023-02-23T14:38:41Z: Starting the AIDE runner daemon
W0223 14:38:41.781706       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-02-23T14:38:41Z: running aide check
2023-02-23T14:39:12Z: aide check returned status 0
$ oc get fileintegritynodestatus
NAME                                                                             NODE                                                        STATUS
master-fileintegrity-xiyuan-23-pr-2f4zq-master-0.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-0.c.openshift-qe.internal         Succeeded
master-fileintegrity-xiyuan-23-pr-2f4zq-master-1.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-1.c.openshift-qe.internal         Succeeded
master-fileintegrity-xiyuan-23-pr-2f4zq-master-2.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-2.c.openshift-qe.internal         Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-a-ncwxc.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-a-ncwxc.c.openshift-qe.internal   Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-b-dc7q4.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-b-dc7q4.c.openshift-qe.internal   Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-c-vc7hc.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-c-vc7hc.c.openshift-qe.internal   Succeeded

xiaojiey · 2023-02-23T14:59:59Z

/label qe-app-approved

openshift-ci · 2023-02-23T15:00:03Z

@xiaojiey: The label(s) /label qe-app-approved cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label qe-app-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci · 2023-02-23T16:54:56Z

@Vincent056: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

rhmdnd

/lgtm

Thanks, Vincent!

openshift-ci · 2023-02-28T22:05:31Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhmdnd, Vincent056

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [Vincent056,rhmdnd]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from jhrozek and rhmdnd February 15, 2023 23:02

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2023

Vincent056 force-pushed the initial branch from 9bdc606 to 347817c Compare February 15, 2023 23:03

Vincent056 mentioned this pull request Feb 15, 2023

Add initialDelay option #302

Closed

rhmdnd previously requested changes Feb 16, 2023

View reviewed changes

openshift-ci bot assigned rhmdnd Feb 16, 2023

Vincent056 force-pushed the initial branch 3 times, most recently from 3f7207d to ec0406c Compare February 17, 2023 07:24

jhrozek reviewed Feb 17, 2023

View reviewed changes

Vincent056 force-pushed the initial branch from ec0406c to 91c04fd Compare February 17, 2023 09:14

Vincent056 requested a review from rhmdnd February 17, 2023 09:16

rhmdnd reviewed Feb 17, 2023

View reviewed changes

rhmdnd self-requested a review February 17, 2023 14:53

Vincent056 force-pushed the initial branch 3 times, most recently from 5381fe2 to 662a92b Compare February 23, 2023 03:08

Add initialDelay

2814e6e

We added initalDelay option to FileIntegrity CRD to allow users to specify the initial delay before the first scan is run. This is useful for environments where the operator is deployed before cluster is fully ready.

Vincent056 force-pushed the initial branch from 662a92b to 2814e6e Compare February 23, 2023 06:12

rhmdnd approved these changes Feb 28, 2023

View reviewed changes

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2023

openshift-merge-robot merged commit eccbb07 into openshift:master Feb 28, 2023

Add initialDelay #303

Add initialDelay #303

Uh oh!

Conversation

Vincent056 commented Feb 15, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhmdnd left a comment

Choose a reason for hiding this comment

Uh oh!

rhmdnd Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rhmdnd Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

rhmdnd Feb 16, 2023

Choose a reason for hiding this comment

Uh oh!

Vincent056 Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

Vincent056 commented Feb 16, 2023

Uh oh!

jhrozek Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

Vincent056 Feb 17, 2023

Choose a reason for hiding this comment

Uh oh!

Vincent056 commented Feb 17, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhmdnd left a comment

Choose a reason for hiding this comment

Uh oh!

Vincent056 commented Feb 17, 2023

Uh oh!

Vincent056 commented Feb 20, 2023

Uh oh!

Vincent056 commented Feb 23, 2023

Uh oh!

xiaojiey commented Feb 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

xiaojiey commented Feb 23, 2023

Uh oh!

openshift-ci bot commented Feb 23, 2023

Uh oh!

openshift-ci bot commented Feb 23, 2023

Uh oh!

rhmdnd left a comment

Choose a reason for hiding this comment

Uh oh!

openshift-ci bot commented Feb 28, 2023

Uh oh!

Uh oh!

Vincent056 commented Feb 15, 2023 •

edited

Loading

Vincent056 commented Feb 17, 2023 •

edited

Loading

xiaojiey commented Feb 23, 2023 •

edited

Loading