Skip to content

Benchmark stdlib compression code #392

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
emmatyping opened this issue May 22, 2025 · 4 comments
Open

Benchmark stdlib compression code #392

emmatyping opened this issue May 22, 2025 · 4 comments

Comments

@emmatyping
Copy link
Member

At PyConUS, I was chatting with @gpshead about adding compression benchmarks. While a lot of the "heavy lifting" of compression happens in the libraries CPython binds (zlib, liblzma, etc.), the handling of output buffers in CPython has a significant impact in performance, and is something we don't have a lot of visibility into the performance of.

One of the better known cross-algorithm compression benchmarks I'm aware of is lzbench, which tests compression performance of the Silesia compression corpus across many algorithms. I figure running compression benchmarks at varied settings on Silesia would provide a good starting point for benchmarking the output buffer and other binding code.

@gpshead
Copy link
Member

gpshead commented May 22, 2025

We shouldn't care about the underlying compression library's own performance, but running them all at their least compression/fastest modes would give us an idea of our own overhead.

Some may have a "0" mode for no compression - in others "0" means "default" or is an error IIRC - but otherwise they all have a concept of fast, just universally using a level of "1" regardless of algorithm is probably sufficient enough that we shouldn't overthink it.

@hauntsaninja
Copy link
Contributor

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

@emmatyping
Copy link
Member Author

We shouldn't care about the underlying compression library's own performance

Absolutely agree. I think I'm imagining we'd want to keep the underlying compression library versions as consistent as possible to avoid benchmarking them. Then we can compare across changes, which I think is the main benefit. One area that I'd like to make sure doesn't regress (and potentially look at tweaking/improving) is the output buffer code: https://github.com/python/cpython/blob/main/Include/internal/pycore_blocks_output_buffer.h, which can have a significant impact on performance.

running them all at their least compression/fastest modes would give us an idea of our own overhead.

I expect that this may not be a representative benchmark because the amount of output matters here, so choosing very low compression levels will probably exaggerate our overhead. It also would not properly benchmark the output buffer code as the buffer sizes could be significantly larger than real world scenarios.

What I'd like to see is benchmarking where the library version stays the same and we vary the size of data, compression level, and potentially some compression flags. The point is not to compare among these configurations, but rather to be used to compare between changes in stdlib code across several different usage scenarios.

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

I think that could also be interesting, but it uses unstable libzstd APIs which I do not want the stdlib to use, and builds against the latest version of libzstd. I expect comparisons there might be tricky.

@emmatyping
Copy link
Member Author

emmatyping commented May 23, 2025

Also I just re-read my original message and realized it could be read to mean that I want to run lzbench or tests like it to check the performance of underlying compression libraries. That's not what I want! Sorry for any confusion. I was merely calling it out as prior art/inspiration and that the Silesia corpus is probably a good dataset to use when we write our own benchmarks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants