Skip to content

Implement parallel ItemBlock processing via backup_controller goroutines #8659

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Feb 14, 2025

Conversation

sseago
Copy link
Collaborator

@sseago sseago commented Jan 29, 2025

Thank you for contributing to Velero!

Please add a summary of your change

Parallel ItemBlock processing via backup_controller goroutines as described in phase 2 of https://github.com/vmware-tanzu/velero/blob/main/design/backup-performance-improvements.md

Does your change fix a particular issue?

Fixes #8334

Please indicate you've done the following:

@sseago sseago marked this pull request as draft January 29, 2025 21:22
@sseago
Copy link
Collaborator Author

sseago commented Jan 29, 2025

Currently in draft, as I have not yet tested the changes in a cluster env.

Copy link

codecov bot commented Jan 29, 2025

Codecov Report

Attention: Patch coverage is 91.78082% with 12 lines in your changes missing coverage. Please review.

Project coverage is 59.47%. Comparing base (79707aa) to head (fcfb2fd).
Report is 6 commits behind head on main.

Files with missing lines Patch % Lines
pkg/backup/backup.go 87.01% 9 Missing and 1 partial ⚠️
pkg/backup/item_collector.go 66.66% 1 Missing ⚠️
pkg/controller/backup_controller.go 50.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8659      +/-   ##
==========================================
+ Coverage   59.39%   59.47%   +0.08%     
==========================================
  Files         370      371       +1     
  Lines       39988    40107     +119     
==========================================
+ Hits        23749    23853     +104     
- Misses      14746    14760      +14     
- Partials     1493     1494       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@shawn-hurley shawn-hurley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I have just one concern otherwise nothing stands out!

Copy link
Collaborator

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some typos. other than that no objections atm.

Note: I am not as familiar with select-case channel statements.

@sseago sseago force-pushed the parallel-itemblocks branch from 5cee460 to d08bb75 Compare February 5, 2025 00:36
@sseago
Copy link
Collaborator Author

sseago commented Feb 5, 2025

Updated in response to PR comments. Will test in cluster tomorrow.

@sseago sseago force-pushed the parallel-itemblocks branch 2 times, most recently from dad8770 to 097941e Compare February 6, 2025 00:29
@sseago sseago marked this pull request as ready for review February 6, 2025 00:30
@sseago
Copy link
Collaborator Author

sseago commented Feb 6, 2025

I've tested this in an aws cluster. Two namespaces, 4 pods, 2 volumes, and additional related resources. Using vmware-tanzu/velero-plugin-example#75 to force each item backup to take 5 seconds allowed me to verify parallel processing. Starting with 1 configured worker, up to 16 -- each time dropped the total backup time, and I was able to verify from the logs that the 5-second duration BIA actions were happening in parallel to each other in the expected number.

Summary of backup time/results is below. Note that the differing item count on subsequent backups was mainly due to the inclusion of events. If those had been excluded, the time drop from one backup to the next would be more significant. Also note that with a regular real-world backup of this size, the time differences would be much smaller, as I've introduced an artificial time duration per item here.

The real-world use case where parallel backup will be most pronounced would be a backup with a large number of small PVCs using CSI snapshots or datamover.

mysql-persistent in two namespaces. 5 second BIA delay per item

1 worker: mysql-1worker-01
Started:    2025-02-05 15:08:57 +0000 UTC
Completed:  2025-02-05 15:15:58 +0000 UTC
Elapsed: 07:01
Items backed up:              77

2 workers: mmysql-2worker-01
Started:    2025-02-05 16:04:21 +0000 UTC
Completed:  2025-02-05 16:08:32 +0000 UTC
Elapsed:  04:01
Items backed up:              83

4 workers: mmysql-4worker-01
Started:    2025-02-05 16:33:26 +0000 UTC
Completed:  2025-02-05 16:36:01 +0000 UTC
Elapsed: 02:35
Items backed up:              89

8 workers: mmysql-8worker-01
Started:    2025-02-05 16:48:15 +0000 UTC
Completed:  2025-02-05 16:50:01 +0000 UTC
Elapsed: 01:46
Total items to be backed up:  96

16 workers: mmysql-16worker-01
Started:    2025-02-05 16:58:00 +0000 UTC
Completed:  2025-02-05 16:59:37 +0000 UTC
Elapsed:  01:37
Items backed up:              102

@sseago sseago force-pushed the parallel-itemblocks branch 3 times, most recently from e1ab7c0 to fd2ee86 Compare February 7, 2025 05:54
@sseago
Copy link
Collaborator Author

sseago commented Feb 7, 2025

rebased after #8664 merged

func (p *ItemBlockWorkerPool) Stop() {
p.cancelFunc()
p.logger.Info("ItemBlock worker stopping")
p.wg.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another log to indicate work pool stopped and the wait time

@sseago sseago force-pushed the parallel-itemblocks branch 3 times, most recently from 57ea842 to 849b896 Compare February 11, 2025 23:23
@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Feb 12, 2025

@sseago
For the CI failure, only this PR has the problem.
It looks like a file pkg/cmd/cli/backup/bk-to-be-download-data.tar.gz is added in this PR by mistake, which may be cause of the problem.

@sseago
Copy link
Collaborator Author

sseago commented Feb 12, 2025

@sseago For the CI failure, only this PR has the problem. It looks like a file pkg/cmd/cli/backup/bk-to-be-download-data.tar.gz is added in this PR by mistake, which may be cause of the problem.

Oh. Yes, that looks like it must have been inadvertently added. That file is sometimes left behind if you cancel in the middle of running unit tests. I'll get rid of that and re-push.

@sseago sseago force-pushed the parallel-itemblocks branch from 849b896 to acb6df0 Compare February 12, 2025 16:58
@sseago sseago force-pushed the parallel-itemblocks branch from acb6df0 to fcfb2fd Compare February 12, 2025 17:03
@kaovilai
Copy link
Collaborator

That file is sometimes left behind if you cancel in the middle of running unit tests. I'll get rid of that and re-push.

perhaps .gitignore candidate.

@Lyndon-Li
Copy link
Contributor

LGTM. Please wait @ywk253100 have another look since issue #8516 depends on this PR.

@ywk253100 ywk253100 merged commit e3a6406 into vmware-tanzu:main Feb 14, 2025
42 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Phase 2 implementation for backup performance improvements
6 participants