Skip to content

Add initialDelay #303

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 28, 2023
Merged

Conversation

Vincent056
Copy link
Contributor

@Vincent056 Vincent056 commented Feb 15, 2023

We added initialDelay option to FileIntegrity CRD to allow users to specify the initial delay before the first scan is run. This is useful for environments where the operator is deployed before the cluster is fully ready.
Launch of aide demonset will be delayed according to the value of initialDelay set in FileIntegrity Object.

To launch a FileIntegrity Check for worker nodes with a delay of 100s you can create the following object:

apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: worker-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/worker: ""
  config:
      initialDelay: 100

@openshift-ci openshift-ci bot requested review from jhrozek and rhmdnd February 15, 2023 23:02
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 15, 2023
rhmdnd
rhmdnd previously requested changes Feb 16, 2023
Copy link
Contributor

@rhmdnd rhmdnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this proposed!

A few suggestions inline along with a recommendation to make the feature non-blocking.

@@ -44,6 +44,11 @@ spec:
default: 900
description: Time between individual aide scans
type: integer
initialDelay:
description: InitalDelaySeconds is the number of seconds to wait
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: InitialDelaySeconds

"first scan. It"

} else {
// sleep for the initial delay
reqLogger.Info("InitialDelaySeconds set, sleeping", "InitialDelaySeconds", instance.Spec.Config.InitialDelay)
time.Sleep(time.Duration(instance.Spec.Config.InitialDelay) * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on my testing - this is a blocking operation and could to confusion with multiple file integrity resources.

I tested this by creating two different scans:

apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: master-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/master: ""
  config:
      name: master-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 120
$ cat worker-fio.yaml
apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: worker-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/worker: ""
  config:
      name: worker-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 120

I created them at the same time. I expect that the daemonSets for each would be available at the same time (since they use the same initial delay and I created them around the same time).

What I observed is that the timer for the second file integrity starts after the first is created because the reconcile loop is sleeping.

NAME                                       DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
daemonset.apps/aide-master-fileintegrity   3         3         3       3            3           node-role.kubernetes.io/master=   2m6s
daemonset.apps/aide-worker-fileintegrity   2         2         2       2            2           node-role.kubernetes.io/worker=   6s

Is it possible to re-queue the request and determine this based on the request timestamp?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the following to make this a non-blocking operation.

 $ git d pkg/controller/
diff --git a/pkg/controller/fileintegrity/fileintegrity_controller.go b/pkg/controller/fileintegrity/fileintegrity_controller.go
index c18a49d8..37b7cb5a 100644
--- a/pkg/controller/fileintegrity/fileintegrity_controller.go
+++ b/pkg/controller/fileintegrity/fileintegrity_controller.go
@@ -440,13 +440,16 @@ func (r *FileIntegrityReconciler) FileIntegrityControllerReconcile(request recon
                        return reconcile.Result{}, legacyDeleteErr
                }

-               // check if we have initialDelay set
-               if instance.Spec.Config.InitialDelay == 0 {
-                       reqLogger.Info("InitialDelaySeconds not set, creating deamonset now")
-               } else {
-                       // sleep for the initial delay
-                       reqLogger.Info("InitialDelaySeconds set, sleeping", "InitialDelaySeconds", instance.Spec.Config.InitialDelay)
-                       time.Sleep(time.Duration(instance.Spec.Config.InitialDelay) * time.Second)
+               // Check if we're past the initial delay timer by evaluating
+               // the time since creation.
+               n := time.Now()
+               d := n.Sub(instance.GetCreationTimestamp().Time)
+
+               if d.Seconds() < float64(instance.Spec.Config.InitialDelay) {
+                       s := fmt.Sprintf("Re-queuing request for %s because elapsed time since creation (%f seconds) hasn't exceeded InitialDelay of %d seconds", instance.Name, d.Seconds(), instance.Spec.Config.InitialDelay)
+                       reqLogger.Info(s)
+                       // requeue the request
+                       return reconcile.Result{Requeue: true}, nil
                }

                reqLogger.Info("Creating daemonSet", "DaemonSet", daemonSetName)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a good catch, thanks for it, this works now!

@Vincent056
Copy link
Contributor Author

Thanks for getting this proposed!

A few suggestions inline along with a recommendation to make the feature non-blocking.

Thanks for the throughout review, let me address those issues

@Vincent056 Vincent056 force-pushed the initial branch 3 times, most recently from 3f7207d to ec0406c Compare February 17, 2023 07:24
// the time since creation.
d := time.Since(instance.CreationTimestamp.Time)

if d.Seconds() < float64(instance.Spec.Config.InitialDelay) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be more idiomatic to try to use the time.Before() or time.After methods.
Something like:

shouldScheduleAt := instance.CreationTimestamp.Time.Add(instance.Spec.Config.InitialDelay)
if time.Now().Before(shouldScheduleAt) {
 // reconcile
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review, this makes sense, just updated the code

@Vincent056
Copy link
Contributor Author

Vincent056 commented Feb 17, 2023

Have success e2e run on my local cluster, it looks like we have some issues with must gather

@Vincent056 Vincent056 requested a review from rhmdnd February 17, 2023 09:16
Copy link
Contributor

@rhmdnd rhmdnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code changes look good. Just have to figure out the e2e issues.

@rhmdnd rhmdnd self-requested a review February 17, 2023 14:53
@rhmdnd rhmdnd dismissed their stale review February 17, 2023 14:54

My comments were addressed.

@Vincent056
Copy link
Contributor Author

/retest

1 similar comment
@Vincent056
Copy link
Contributor Author

/retest

@Vincent056 Vincent056 force-pushed the initial branch 3 times, most recently from 5381fe2 to 662a92b Compare February 23, 2023 03:08
We added initalDelay option to FileIntegrity CRD to allow users to specify the initial delay before the first scan is run. This is useful for environments where the operator is deployed before cluster is fully ready.
@Vincent056
Copy link
Contributor Author

/retest

@xiaojiey
Copy link

xiaojiey commented Feb 23, 2023

Verification pass with 4.13.0-0.nightly-2023-02-23-000625 and code in the PR:

1. deploy FIO with code in the PR
2. Create fileintegrity with initialDelay set to 300, and check results:
$ oc create cm worker-aide-conf --from-file=aide-conf=aide.conf.rhel8
configmap/worker-aide-conf created
$ oc create cm master-aide-conf --from-file=aide-conf=aide.conf.rhel8.1
configmap/master-aide-conf created
$ oc apply -f -<<EOF
---
apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: worker-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/worker: ""
  config:
      name: worker-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 300
      gracePeriod: 300
---
apiVersion: fileintegrity.openshift.io/v1alpha1
kind: FileIntegrity
metadata:
  name: master-fileintegrity
  namespace: openshift-file-integrity
spec:
  nodeSelector:
      node-role.kubernetes.io/master: ""
  config:
      name: master-aide-conf
      namespace: openshift-file-integrity
      initialDelay: 300
      gracePeriod: 300
EOF
fileintegrity.fileintegrity.openshift.io/worker-fileintegrity created
fileintegrity.fileintegrity.openshift.io/master-fileintegrity created
$ oc get daemonsets.apps  -w
^C
$ oc get fileintegrity
NAME                   AGE
master-fileintegrity   4m42s
worker-fileintegrity   4m43s
$ oc get daemonsets.apps  -w
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
aide-worker-fileintegrity   0         0         0       0            0           node-role.kubernetes.io/worker=   0s
aide-worker-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/worker=   0s
aide-master-fileintegrity   0         0         0       0            0           node-role.kubernetes.io/master=   0s
aide-master-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/master=   0s
^C
$ oc get fileintegrity
NAME                   AGE
master-fileintegrity   5m4s
worker-fileintegrity   5m5s
$ oc get daemonsets.apps  -w
NAME                        DESIRED   CURRENT   READY   UP-TO-DATE   AVAILABLE   NODE SELECTOR                     AGE
aide-master-fileintegrity   3         3         1       3            1           node-role.kubernetes.io/master=   8s
aide-worker-fileintegrity   3         3         0       3            0           node-role.kubernetes.io/worker=   9s
aide-worker-fileintegrity   3         3         1       3            1           node-role.kubernetes.io/worker=   17s
aide-worker-fileintegrity   3         3         2       3            2           node-role.kubernetes.io/worker=   19s
aide-worker-fileintegrity   3         3         3       3            3           node-role.kubernetes.io/worker=   19s
aide-master-fileintegrity   3         3         2       3            2           node-role.kubernetes.io/master=   20s
aide-master-fileintegrity   3         3         3       3            3           node-role.kubernetes.io/master=   21s
^C
$ oc get pod
NAME                                      READY   STATUS    RESTARTS      AGE
aide-master-fileintegrity-8d5n6           1/1     Running   0             42s
aide-master-fileintegrity-qhqsp           1/1     Running   0             42s
aide-master-fileintegrity-tdrxn           1/1     Running   0             42s
aide-worker-fileintegrity-8vt5t           1/1     Running   0             43s
aide-worker-fileintegrity-rbzpp           1/1     Running   0             43s
aide-worker-fileintegrity-wgrcs           1/1     Running   0             43s
file-integrity-operator-9dd874f46-888hn   1/1     Running   1 (34m ago)   35m
$ oc logs pod/aide-master-fileintegrity-8d5n6 --all-containers 
2023-02-23T14:38:41Z: Starting the AIDE runner daemon
W0223 14:38:41.781706       1 client_config.go:615] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2023-02-23T14:38:41Z: running aide check
2023-02-23T14:39:12Z: aide check returned status 0
$ oc get fileintegritynodestatus
NAME                                                                             NODE                                                        STATUS
master-fileintegrity-xiyuan-23-pr-2f4zq-master-0.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-0.c.openshift-qe.internal         Succeeded
master-fileintegrity-xiyuan-23-pr-2f4zq-master-1.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-1.c.openshift-qe.internal         Succeeded
master-fileintegrity-xiyuan-23-pr-2f4zq-master-2.c.openshift-qe.internal         xiyuan-23-pr-2f4zq-master-2.c.openshift-qe.internal         Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-a-ncwxc.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-a-ncwxc.c.openshift-qe.internal   Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-b-dc7q4.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-b-dc7q4.c.openshift-qe.internal   Succeeded
worker-fileintegrity-xiyuan-23-pr-2f4zq-worker-c-vc7hc.c.openshift-qe.internal   xiyuan-23-pr-2f4zq-worker-c-vc7hc.c.openshift-qe.internal   Succeeded

@xiaojiey
Copy link

/label qe-app-approved

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2023

@xiaojiey: The label(s) /label qe-app-approved cannot be applied. These labels are supported: platform/aws, platform/azure, platform/baremetal, platform/google, platform/libvirt, platform/openstack, ga, tide/merge-method-merge, tide/merge-method-rebase, tide/merge-method-squash, px-approved, docs-approved, qe-approved, downstream-change-needed, approved, backport-risk-assessed, bugzilla/valid-bug, cherry-pick-approved, jira/valid-bug, staff-eng-approved. Is this label configured under labels -> additional_labels or labels -> restricted_labels in plugin.yaml?

In response to this:

/label qe-app-approved

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 23, 2023

@Vincent056: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Copy link
Contributor

@rhmdnd rhmdnd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Thanks, Vincent!

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 28, 2023
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Feb 28, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rhmdnd, Vincent056

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-robot openshift-merge-robot merged commit eccbb07 into openshift:master Feb 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants