-
Notifications
You must be signed in to change notification settings - Fork 197
Migrate FlowNode
storage to BulkFlowNodeStorage
upon execution completion to improve read performance
#807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate FlowNode
storage to BulkFlowNodeStorage
upon execution completion to improve read performance
#807
Conversation
…execution completion
// TODO: A more conservative option could be to instead keep using the current storage until after | ||
// a restart. In that case we'd have to defer deletion of the old storage dir and handle it in | ||
// initializeStorage. | ||
this.storage = new TimingFlowNodeStorage(newStorage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if this is safe or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is some potential for race conditions here. Methods like CpsFlowExecution.getNode
are not synchronized
, so something could call getNode
while storage
still points to SimpleXStreamFlowNodeStorage
, and then if that node is not in the cache, it will be read from disk, but in the meantime optimizeStorage
could be deleting workflow/
after swapping this.storage
, leading to getNode
temporarily returning null
, or the workflow/
deletion failing, etc.
I can think of a few options:
- Only delete
workflow/
ininitializeStorage
ifthis.storageDir
isworkflow-completed
. Not ideal if the build never actually gets looked at again, becauseworkflow/
will contain redundant data, but simple. - Use
Cleaner
inoptimizeStorage
to register an action to delete the old storage directory once the oldthis.storage.delegate
is phantom reachable instead of doing it immediately. Should work fine in the happy path, but if Jenkins crashes or somethingworkflow/
might not get cleaned up. - Add some kind of locking around
this.storage
. Has potential for bugs and general performance issues. Perhaps aReadWriteLock
would perform well enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The third option seems most straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I thought so as well, but it got quite messy. See 4b89fed. It is awkward because although many methods that consume storage
probably cannot ever run concurrently with onProgramEnd
and so do not need any locking, it is not immediately obvious and so I am not really comfortable with skipping the locks selectively. Adding a new lock also introduces the possibility of deadlock if something attempts to acquire the monitor for CpsFlowExecution
while holding storageLock
and the build completes concurrently (nothing does this today as far as I can see).
The Cleaner
approach is quite a bit simpler in terms of code changes, for what that's worth. Perhaps we could combine that approach with deletion of workflow/
in initializeStorage
to guarantee deletion in most cases without having to introduce the new lock and change all associated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, maybe actually I can reuse the lock inside of TimingFlowNodeStorage
instead to simplify things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I think 62119b8 is safe and it is far simpler than introducing another lock. I wish I had noticed it yesterday...
plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Outdated
Show resolved
Hide resolved
*/ | ||
private void optimizeStorage(FlowNode flowEndNode) { | ||
if (storage.delegate instanceof SimpleXStreamFlowNodeStorage) { | ||
String newStorageDir = (this.storageDir != null) ? this.storageDir + "-completed" : "workflow-completed"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to what we do with workflow-fallback
in createPlaceholderNodes
, but of course other approaches could be used.
// The hope is that by doing this right when the execution completes, most of the nodes will already be | ||
// cached in memory, saving us the cost of having to read them all from disk. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be simpler to do the migration when first loading the execution, but then we guarantee the worst case of having to read all the individual FlowNode
XML files just to write them again, which I think in practice would mean that you would only see performance benefits with that approach when navigating to a build after two restarts in row.
plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/PersistenceProblemsTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks right. May be possible to write a JenkinsSessionRule
test verifying accesses to flow nodes after build completion in the same session, and then in the next session, asserting somehow that the optimized storage is in use.
// TODO: A more conservative option could be to instead keep using the current storage until after | ||
// a restart. In that case we'd have to defer deletion of the old storage dir and handle it in | ||
// initializeStorage. | ||
this.storage = new TimingFlowNodeStorage(newStorage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this require a build.xml
save? Or is that done by something else anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The two places that call onProgramEnd
(CpsFlowExecution.croak
and CpsThreadGroup.run
) both call CpsFlowExecution.saveOwner
right after the call, so I think things will be ok, but I will check in more detail later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One change I will make is to expand the outer catch clause in optimizeStorage
to catch Exception
in case DepthFirstScanner
can throw an NPE or similar runtime exceptions in unusual cases.
Other than that, I think things should be ok unless we ever see a Throwable
that is not an Exception
in onProgramEnd
(e.g. StackOverflowError
, OutOfMemoryError
), but that is the same with or without this PR. I think we could perhaps catch Throwable
or adapt existing callers to use try
/finally
to be more robust against thrown Error
s but I am not sure that it would make a difference in practice.
plugin/src/test/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecutionTest.java
Outdated
Show resolved
Hide resolved
…tions and log overall errors at WARNING level
As far as performance, I did some very basic local testing (M1 Mac, 32GB RAM, SSD) by running a Pipeline that just loops over I also added some |
…torage" This reverts commit 4b89fed.
…old storage dir and replacing storage.delegate
One thing I noticed while looking into all of this is that |
FlowNode
storage to BulkFlowNodeStorage
upon execution completionFlowNode
storage to BulkFlowNodeStorage
upon execution completion to improve read performance
|
Looks like we just need assertion tweaks in https://github.com/jenkinsci/support-core-plugin/blob/fb93051e5ad707a27ec61ebdbca23b8e66f76358/src/test/java/com/cloudbees/jenkins/support/impl/AbstractItemComponentTest.java#L171 etc.; @dwnusbaum are you on it already? |
Yes |
SimpleXStreamFlowNode
storage optimizes write performance (important when usingMAX_SURVIVABILITY
) at the cost of read performance (in particular network file systems do not deal well with having to read hundreds of tiny littleFlowNode
XML files). For complex Pipeline builds running on controllers using network file systems, loading flow nodes after a Jenkins restart can take a significant amount of time (I observed a case where it was taking 15+ seconds), which can impact performance in various ways, most obviously when opening visualizations like the "Pipeline steps" view.This PR explores the possibility of migrating all completed executions to
BulkFlowNodeStorage
to increase read performance. We don't have to worry about write performance for completed executions, so at least in theory there are no downsides after the migration is complete.Testing done