Skip to content

Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion to improve read performance #807

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

dwnusbaum
Copy link
Member

@dwnusbaum dwnusbaum commented Oct 19, 2023

SimpleXStreamFlowNode storage optimizes write performance (important when using MAX_SURVIVABILITY) at the cost of read performance (in particular network file systems do not deal well with having to read hundreds of tiny little FlowNode XML files). For complex Pipeline builds running on controllers using network file systems, loading flow nodes after a Jenkins restart can take a significant amount of time (I observed a case where it was taking 15+ seconds), which can impact performance in various ways, most obviously when opening visualizations like the "Pipeline steps" view.

This PR explores the possibility of migrating all completed executions to BulkFlowNodeStorage to increase read performance. We don't have to worry about write performance for completed executions, so at least in theory there are no downsides after the migration is complete.

Testing done

### Submitter checklist
- [x] Make sure you are opening from a **topic/feature/bugfix branch** (right side) and not your main branch!
- [x] Ensure that the pull request title represents the desired changelog entry
- [x] Please describe what you did
- [ ] Link to relevant issues in GitHub or Jira
- [ ] Link to relevant pull requests, esp. upstream and downstream changes
- [x] Ensure you have provided tests - that demonstrates feature works or fixes the issue

Comment on lines 542 to 545
// TODO: A more conservative option could be to instead keep using the current storage until after
// a restart. In that case we'd have to defer deletion of the old storage dir and handle it in
// initializeStorage.
this.storage = new TimingFlowNodeStorage(newStorage);
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this is safe or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is some potential for race conditions here. Methods like CpsFlowExecution.getNode are not synchronized, so something could call getNode while storage still points to SimpleXStreamFlowNodeStorage, and then if that node is not in the cache, it will be read from disk, but in the meantime optimizeStorage could be deleting workflow/ after swapping this.storage, leading to getNode temporarily returning null, or the workflow/ deletion failing, etc.

I can think of a few options:

  1. Only delete workflow/ in initializeStorage if this.storageDir is workflow-completed. Not ideal if the build never actually gets looked at again, because workflow/ will contain redundant data, but simple.
  2. Use Cleaner in optimizeStorage to register an action to delete the old storage directory once the old this.storage.delegate is phantom reachable instead of doing it immediately. Should work fine in the happy path, but if Jenkins crashes or something workflow/ might not get cleaned up.
  3. Add some kind of locking around this.storage. Has potential for bugs and general performance issues. Perhaps a ReadWriteLock would perform well enough.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third option seems most straightforward.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I thought so as well, but it got quite messy. See 4b89fed. It is awkward because although many methods that consume storage probably cannot ever run concurrently with onProgramEnd and so do not need any locking, it is not immediately obvious and so I am not really comfortable with skipping the locks selectively. Adding a new lock also introduces the possibility of deadlock if something attempts to acquire the monitor for CpsFlowExecution while holding storageLock and the build completes concurrently (nothing does this today as far as I can see).

The Cleaner approach is quite a bit simpler in terms of code changes, for what that's worth. Perhaps we could combine that approach with deletion of workflow/ in initializeStorage to guarantee deletion in most cases without having to introduce the new lock and change all associated code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe actually I can reuse the lock inside of TimingFlowNodeStorage instead to simplify things.

Copy link
Member Author

@dwnusbaum dwnusbaum Oct 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think 62119b8 is safe and it is far simpler than introducing another lock. I wish I had noticed it yesterday...

*/
private void optimizeStorage(FlowNode flowEndNode) {
if (storage.delegate instanceof SimpleXStreamFlowNodeStorage) {
String newStorageDir = (this.storageDir != null) ? this.storageDir + "-completed" : "workflow-completed";
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to what we do with workflow-fallback in createPlaceholderNodes, but of course other approaches could be used.

Comment on lines +533 to +534
// The hope is that by doing this right when the execution completes, most of the nodes will already be
// cached in memory, saving us the cost of having to read them all from disk.
Copy link
Member Author

@dwnusbaum dwnusbaum Oct 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be simpler to do the migration when first loading the execution, but then we guarantee the worst case of having to read all the individual FlowNode XML files just to write them again, which I think in practice would mean that you would only see performance benefits with that approach when navigating to a build after two restarts in row.

Copy link
Member

@jglick jglick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks right. May be possible to write a JenkinsSessionRule test verifying accesses to flow nodes after build completion in the same session, and then in the next session, asserting somehow that the optimized storage is in use.

// TODO: A more conservative option could be to instead keep using the current storage until after
// a restart. In that case we'd have to defer deletion of the old storage dir and handle it in
// initializeStorage.
this.storage = new TimingFlowNodeStorage(newStorage);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this require a build.xml save? Or is that done by something else anyway?

Copy link
Member Author

@dwnusbaum dwnusbaum Oct 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two places that call onProgramEnd (CpsFlowExecution.croak and CpsThreadGroup.run) both call CpsFlowExecution.saveOwner right after the call, so I think things will be ok, but I will check in more detail later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One change I will make is to expand the outer catch clause in optimizeStorage to catch Exception in case DepthFirstScanner can throw an NPE or similar runtime exceptions in unusual cases.

Other than that, I think things should be ok unless we ever see a Throwable that is not an Exception in onProgramEnd (e.g. StackOverflowError, OutOfMemoryError), but that is the same with or without this PR. I think we could perhaps catch Throwable or adapt existing callers to use try/finally to be more robust against thrown Errors but I am not sure that it would make a difference in practice.

@dwnusbaum
Copy link
Member Author

As far as performance, I did some very basic local testing (M1 Mac, 32GB RAM, SSD) by running a Pipeline that just loops over withEnv[] { sleep 0 } thousands of times, waiting for the build to complete, then restarting Jenkins, opening the script console and timing how long it takes to use DepthFirstScanner to scan all nodes. BulkFlowNodeStorage was between 3 and 6 times faster in my trials (roughly 0.5 seconds vs 2.5 seconds to scan all nodes).

I also added some FINE logging to optimizeStorage just to see how long the migration took in the best case where the build runs to completion the flow node cache never being invalidated, and with around 20k flow nodes in a build there was at most a 1 second gap between starting migration and finishing deletion of workflow/ (standard logger doesn't include milliseconds so that's all the granularity I have for now). In the worst case, where you restart right before the final step completes, after the restart optimizeStorage has to actually load all the nodes from disk, which makes it substantially slower (roughly 2.5 seconds for 20k flow nodes for me).

@dwnusbaum
Copy link
Member Author

One thing I noticed while looking into all of this is that BulkFlowNodeStorage.getOrLoadNodes has no form of internal synchronization, and may be called by indirectly by TimingFlowNodeStorage methods that only acquire a read lock, so I think in theory after a restart many threads could all load the same file concurrently. As far as I can tell the only problem would be wasted effort though.

@dwnusbaum dwnusbaum changed the title Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion Migrate FlowNode storage to BulkFlowNodeStorage upon execution completion to improve read performance Oct 25, 2023
@dwnusbaum dwnusbaum marked this pull request as ready for review October 25, 2023 18:49
@dwnusbaum dwnusbaum requested a review from a team as a code owner October 25, 2023 18:49
@jglick jglick enabled auto-merge (squash) October 25, 2023 18:57
@jglick jglick merged commit a3a6988 into jenkinsci:master Oct 25, 2023
@basil
Copy link
Member

basil commented Oct 26, 2023

java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertNotNull(Assert.java:713)
	at org.junit.Assert.assertNotNull(Assert.java:723)
	at com.cloudbees.jenkins.support.actions.SupportAbstractItemActionTest.generatePipelineBundleDefaultsAndCheckContent(SupportAbstractItemActionTest.java:119)
java.lang.AssertionError
	at org.junit.Assert.fail(Assert.java:87)
	at org.junit.Assert.assertTrue(Assert.java:42)
	at org.junit.Assert.assertTrue(Assert.java:53)
	at com.cloudbees.jenkins.support.impl.AbstractItemComponentTest.addContentsFromPipeline(AbstractItemComponentTest.java:167)

@jglick
Copy link
Member

jglick commented Oct 26, 2023

@dwnusbaum
Copy link
Member Author

Yes

@dwnusbaum
Copy link
Member Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants