Performance: pre-filter document list in scheduled workflow checks #10031
+94
−0
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Proposed change
Currently, scheduled workflow checks will check every document that passes the date filter with
document_matches_workflow
. This works, but of course can be very slow, see the linked thread. Instead, we can try to 'pre-filter' the documents to reduce the number that we fully check. The only 'downside' of this is some redundancy, but theres also an advantage I think because this is just kind of a 'first swipe' but the laterdocument_matches_workflow
check will still run.In particularly the filename check which later runs with fnmatch is simplified to a regex, worst case is no regex in sqlite and it would silently fail but the later check would still run.
Hopefully didnt miss anything here, welcome feedback of course.
See #10012 (8k docs, initially > 30 min --> ~10 seconds)
Type of change
Checklist:
pre-commit
hooks, see documentation.