Benchmark stdlib compression code #392

emmatyping · 2025-05-22T21:48:46Z

At PyConUS, I was chatting with @gpshead about adding compression benchmarks. While a lot of the "heavy lifting" of compression happens in the libraries CPython binds (zlib, liblzma, etc.), the handling of output buffers in CPython has a significant impact in performance, and is something we don't have a lot of visibility into the performance of.

One of the better known cross-algorithm compression benchmarks I'm aware of is lzbench, which tests compression performance of the Silesia compression corpus across many algorithms. I figure running compression benchmarks at varied settings on Silesia would provide a good starting point for benchmarking the output buffer and other binding code.

gpshead · 2025-05-22T22:33:38Z

We shouldn't care about the underlying compression library's own performance, but running them all at their least compression/fastest modes would give us an idea of our own overhead.

Some may have a "0" mode for no compression - in others "0" means "default" or is an error IIRC - but otherwise they all have a concept of fast, just universally using a level of "1" regardless of algorithm is probably sufficient enough that we shouldn't overthink it.

hauntsaninja · 2025-05-22T23:58:27Z

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

emmatyping · 2025-05-23T00:37:01Z

We shouldn't care about the underlying compression library's own performance

Absolutely agree. I think I'm imagining we'd want to keep the underlying compression library versions as consistent as possible to avoid benchmarking them. Then we can compare across changes, which I think is the main benefit. One area that I'd like to make sure doesn't regress (and potentially look at tweaking/improving) is the output buffer code: https://github.com/python/cpython/blob/main/Include/internal/pycore_blocks_output_buffer.h, which can have a significant impact on performance.

running them all at their least compression/fastest modes would give us an idea of our own overhead.

I expect that this may not be a representative benchmark because the amount of output matters here, so choosing very low compression levels will probably exaggerate our overhead. It also would not properly benchmark the output buffer code as the buffer sizes could be significantly larger than real world scenarios.

What I'd like to see is benchmarking where the library version stays the same and we vary the size of data, compression level, and potentially some compression flags. The point is not to compare among these configurations, but rather to be used to compare between changes in stdlib code across several different usage scenarios.

indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to compression.zstd (maybe a bit of a grey area between CPython overhead and underlying library performance)

I think that could also be interesting, but it uses unstable libzstd APIs which I do not want the stdlib to use, and builds against the latest version of libzstd. I expect comparisons there might be tricky.

emmatyping · 2025-05-23T00:39:45Z

Also I just re-read my original message and realized it could be read to mean that I want to run lzbench or tests like it to check the performance of underlying compression libraries. That's not what I want! Sorry for any confusion. I was merely calling it out as prior art/inspiration and that the Silesia corpus is probably a good dataset to use when we write our own benchmarks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Benchmark stdlib compression code #392

Benchmark stdlib compression code #392

emmatyping commented May 22, 2025

gpshead commented May 22, 2025

Uh oh!

hauntsaninja commented May 22, 2025

Uh oh!

emmatyping commented May 23, 2025

Uh oh!

emmatyping commented May 23, 2025 •

edited

Loading

Uh oh!

Uh oh!

Benchmark stdlib compression code #392

Benchmark stdlib compression code #392

Comments

emmatyping commented May 22, 2025

gpshead commented May 22, 2025

Uh oh!

hauntsaninja commented May 22, 2025

Uh oh!

emmatyping commented May 23, 2025

Uh oh!

emmatyping commented May 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

emmatyping commented May 23, 2025 •

edited

Loading