Migrate `FlowNode` storage to `BulkFlowNodeStorage` upon execution completion to improve read performance #807

dwnusbaum · 2023-10-19T21:24:45Z

SimpleXStreamFlowNode storage optimizes write performance (important when using MAX_SURVIVABILITY) at the cost of read performance (in particular network file systems do not deal well with having to read hundreds of tiny little FlowNode XML files). For complex Pipeline builds running on controllers using network file systems, loading flow nodes after a Jenkins restart can take a significant amount of time (I observed a case where it was taking 15+ seconds), which can impact performance in various ways, most obviously when opening visualizations like the "Pipeline steps" view.

This PR explores the possibility of migrating all completed executions to BulkFlowNodeStorage to increase read performance. We don't have to worry about write performance for completed executions, so at least in theory there are no downsides after the migration is complete.

Testing done

### Submitter checklist
- [x] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [x] Ensure that the pull request title represents the desired changelog entry
- [x] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [ ] Link to relevant pull requests, esp. upstream and downstream changes
- [x] Ensure you have provided tests - that demonstrates feature works or fixes the issue

…execution completion

dwnusbaum · 2023-10-19T21:26:51Z

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+                    // TODO: A more conservative option could be to instead keep using the current storage until after
+                    // a restart. In that case we'd have to defer deletion of the old storage dir and handle it in
+                    // initializeStorage.
+                    this.storage = new TimingFlowNodeStorage(newStorage);


Not sure if this is safe or not.

I think there is some potential for race conditions here. Methods like CpsFlowExecution.getNode are not synchronized, so something could call getNode while storage still points to SimpleXStreamFlowNodeStorage, and then if that node is not in the cache, it will be read from disk, but in the meantime optimizeStorage could be deleting workflow/ after swapping this.storage, leading to getNode temporarily returning null, or the workflow/ deletion failing, etc.

I can think of a few options:

Only delete workflow/ in initializeStorage if this.storageDir is workflow-completed. Not ideal if the build never actually gets looked at again, because workflow/ will contain redundant data, but simple.

Use Cleaner in optimizeStorage to register an action to delete the old storage directory once the old this.storage.delegate is phantom reachable instead of doing it immediately. Should work fine in the happy path, but if Jenkins crashes or something workflow/ might not get cleaned up.

Add some kind of locking around this.storage. Has potential for bugs and general performance issues. Perhaps a ReadWriteLock would perform well enough.

The third option seems most straightforward.

Yes I thought so as well, but it got quite messy. See 4b89fed. It is awkward because although many methods that consume storage probably cannot ever run concurrently with onProgramEnd and so do not need any locking, it is not immediately obvious and so I am not really comfortable with skipping the locks selectively. Adding a new lock also introduces the possibility of deadlock if something attempts to acquire the monitor for CpsFlowExecution while holding storageLock and the build completes concurrently (nothing does this today as far as I can see).

The Cleaner approach is quite a bit simpler in terms of code changes, for what that's worth. Perhaps we could combine that approach with deletion of workflow/ in initializeStorage to guarantee deletion in most cases without having to introduce the new lock and change all associated code.

Hmm, maybe actually I can reuse the lock inside of TimingFlowNodeStorage instead to simplify things.

Yes I think 62119b8 is safe and it is far simpler than introducing another lock. I wish I had noticed it yesterday...

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

dwnusbaum · 2023-10-19T21:32:04Z

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+     */
+    private void optimizeStorage(FlowNode flowEndNode) {
+        if (storage.delegate instanceof SimpleXStreamFlowNodeStorage) {
+            String newStorageDir = (this.storageDir != null) ? this.storageDir + "-completed" : "workflow-completed";


Similar to what we do with workflow-fallback in createPlaceholderNodes, but of course other approaches could be used.

dwnusbaum · 2023-10-19T21:35:33Z

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+                // The hope is that by doing this right when the execution completes, most of the nodes will already be
+                // cached in memory, saving us the cost of having to read them all from disk.


It would be simpler to do the migration when first loading the execution, but then we guarantee the worst case of having to read all the individual FlowNode XML files just to write them again, which I think in practice would mean that you would only see performance benefits with that approach when navigating to a build after two restarts in row.

plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/PersistenceProblemsTest.java

jglick

Looks right. May be possible to write a JenkinsSessionRule test verifying accesses to flow nodes after build completion in the same session, and then in the next session, asserting somehow that the optimized storage is in use.

jglick · 2023-10-19T22:06:40Z

plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java

+                    // TODO: A more conservative option could be to instead keep using the current storage until after
+                    // a restart. In that case we'd have to defer deletion of the old storage dir and handle it in
+                    // initializeStorage.
+                    this.storage = new TimingFlowNodeStorage(newStorage);


Does this require a build.xml save? Or is that done by something else anyway?

The two places that call onProgramEnd (CpsFlowExecution.croak and CpsThreadGroup.run) both call CpsFlowExecution.saveOwner right after the call, so I think things will be ok, but I will check in more detail later.

One change I will make is to expand the outer catch clause in optimizeStorage to catch Exception in case DepthFirstScanner can throw an NPE or similar runtime exceptions in unusual cases.

Other than that, I think things should be ok unless we ever see a Throwable that is not an Exception in onProgramEnd (e.g. StackOverflowError, OutOfMemoryError), but that is the same with or without this PR. I think we could perhaps catch Throwable or adapt existing callers to use try/finally to be more robust against thrown Errors but I am not sure that it would make a difference in practice.

plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionTest.java

…ate tests

…tions and log overall errors at WARNING level

dwnusbaum · 2023-10-24T19:31:09Z

As far as performance, I did some very basic local testing (M1 Mac, 32GB RAM, SSD) by running a Pipeline that just loops over withEnv[] { sleep 0 } thousands of times, waiting for the build to complete, then restarting Jenkins, opening the script console and timing how long it takes to use DepthFirstScanner to scan all nodes. BulkFlowNodeStorage was between 3 and 6 times faster in my trials (roughly 0.5 seconds vs 2.5 seconds to scan all nodes).

I also added some FINE logging to optimizeStorage just to see how long the migration took in the best case where the build runs to completion the flow node cache never being invalidated, and with around 20k flow nodes in a build there was at most a 1 second gap between starting migration and finishing deletion of workflow/ (standard logger doesn't include milliseconds so that's all the granularity I have for now). In the worst case, where you restart right before the final step completes, after the restart optimizeStorage has to actually load all the nodes from disk, which makes it substantially slower (roughly 2.5 seconds for 20k flow nodes for me).

…torage" This reverts commit 4b89fed.

…old storage dir and replacing storage.delegate

dwnusbaum · 2023-10-25T18:48:25Z

One thing I noticed while looking into all of this is that BulkFlowNodeStorage.getOrLoadNodes has no form of internal synchronization, and may be called by indirectly by TimingFlowNodeStorage methods that only acquire a read lock, so I think in theory after a restart many threads could all load the same file concurrently. As far as I can tell the only problem would be wasted effort though.

basil · 2023-10-26T21:00:28Z

java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertNotNull(Assert.java:713)
	at org.junit.Assert.assertNotNull(Assert.java:723)
	at com.cloudbees.jenkins.support.actions.SupportAbstractItemActionTest.generatePipelineBundleDefaultsAndCheckContent(SupportAbstractItemActionTest.java:119)

java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at com.cloudbees.jenkins.support.impl.AbstractItemComponentTest.addContentsFromPipeline(AbstractItemComponentTest.java:167)

jglick · 2023-10-26T21:09:11Z

Looks like we just need assertion tweaks in https://github.com/jenkinsci/support-core-plugin/blob/fb93051e5ad707a27ec61ebdbca23b8e66f76358/src/test/java/com/cloudbees/jenkins/support/impl/AbstractItemComponentTest.java#L171 etc.; @dwnusbaum are you on it already?

dwnusbaum · 2023-10-26T21:09:53Z

Yes

dwnusbaum · 2023-10-26T22:16:11Z

jenkinsci/support-core-plugin#494

Unconditionally migrate FlowNode storage to BulkFlowNodeStorage upon …

9960b14

…execution completion

dwnusbaum added the enhancement label Oct 19, 2023

dwnusbaum commented Oct 19, 2023

View reviewed changes

plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/PersistenceProblemsTest.java Outdated Show resolved Hide resolved

jglick reviewed Oct 19, 2023

View reviewed changes

Add smoke test for flow node storage optimization

58dd741

jglick reviewed Oct 21, 2023

View reviewed changes

plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionTest.java Outdated Show resolved Hide resolved

dwnusbaum added 3 commits October 24, 2023 13:19

Merge branch 'master' into optimize-storage-for-completed-executions

440f7da

Add system property to control behavior, expand catch clause, and upd…

3c8c225

…ate tests

Add FINE logging to optimizeStorage to track timing of internal opera…

a5b2216

…tions and log overall errors at WARNING level

dwnusbaum added 4 commits October 24, 2023 17:54

Add a read/write lock to control access to CpsFlowExecution.storage

4b89fed

Revert "Add a read/write lock to control access to CpsFlowExecution.s…

5d67a74

…torage" This reverts commit 4b89fed.

Acquire TimingFlowNodeStorage.readWriteLock write lock when deleting …

62119b8

…old storage dir and replacing storage.delegate

Unignore PersistenceProblemsTest.completedFinalFlowNodeNotPersisted

4e5ea9f

dwnusbaum changed the title ~~Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion~~ Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion to improve read performance Oct 25, 2023

dwnusbaum marked this pull request as ready for review October 25, 2023 18:49

dwnusbaum requested a review from a team as a code owner October 25, 2023 18:49

jglick approved these changes Oct 25, 2023

View reviewed changes

jglick enabled auto-merge (squash) October 25, 2023 18:57

jglick merged commit a3a6988 into jenkinsci:master Oct 25, 2023

dwnusbaum deleted the optimize-storage-for-completed-executions branch October 26, 2023 21:01

dwnusbaum mentioned this pull request Oct 26, 2023

Modify Pipeline-related test assertions to not assume that SimpleXStreamFlowNodeStorage is being used jenkinsci/support-core-plugin#494

Merged

dwnusbaum mentioned this pull request Dec 14, 2023

Store DownstreamBuildAction on Run instead of FlowNode for consistent persistence even on completed builds jenkinsci/pipeline-build-step-plugin#128

Merged

dwnusbaum mentioned this pull request Jan 2, 2024

Throw an exception when trying to save actions for a FlowNode for a completed execution #826

Merged

2 tasks

dwnusbaum mentioned this pull request Feb 1, 2024

Delete DownstreamJobListener and instead create NodeDownstreamBuildAction dynamically during graph processing based on DownstreamBuildAction jenkinsci/blueocean-plugin#2540

Open

7 tasks

jglick mentioned this pull request Mar 18, 2024

Save flowNodeStore.xml entries in order jenkinsci/workflow-support-plugin#255

Merged

		// The hope is that by doing this right when the execution completes, most of the nodes will already be
		// cached in memory, saving us the cost of having to read them all from disk.

Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion to improve read performance #807

Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion to improve read performance #807

Uh oh!

Conversation

dwnusbaum commented Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Testing done

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Oct 25, 2023 • edited by jglick Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Oct 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jglick left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dwnusbaum Oct 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dwnusbaum commented Oct 24, 2023

Uh oh!

dwnusbaum commented Oct 25, 2023

Uh oh!

basil commented Oct 26, 2023

Uh oh!

jglick commented Oct 26, 2023

Uh oh!

dwnusbaum commented Oct 26, 2023

Uh oh!

dwnusbaum commented Oct 26, 2023

Uh oh!

Uh oh!

Migrate `FlowNode` storage to `BulkFlowNodeStorage` upon execution completion to improve read performance #807

Migrate `FlowNode` storage to `BulkFlowNodeStorage` upon execution completion to improve read performance #807

dwnusbaum commented Oct 19, 2023 •

edited

Loading

dwnusbaum Oct 25, 2023 •

edited by jglick

Loading

dwnusbaum Oct 19, 2023 •

edited

Loading

dwnusbaum Oct 20, 2023 •

edited

Loading