-
-
Notifications
You must be signed in to change notification settings - Fork 188
Benchmark stdlib compression code #392
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
We shouldn't care about the underlying compression library's own performance, but running them all at their least compression/fastest modes would give us an idea of our own overhead. Some may have a "0" mode for no compression - in others "0" means "default" or is an error IIRC - but otherwise they all have a concept of fast, just universally using a level of "1" regardless of algorithm is probably sufficient enough that we shouldn't overthink it. |
indygreg's zstandard exposes a richer API than pyzstd, so could be interesting to see what perf you can get out of it compared to |
Absolutely agree. I think I'm imagining we'd want to keep the underlying compression library versions as consistent as possible to avoid benchmarking them. Then we can compare across changes, which I think is the main benefit. One area that I'd like to make sure doesn't regress (and potentially look at tweaking/improving) is the output buffer code: https://github.com/python/cpython/blob/main/Include/internal/pycore_blocks_output_buffer.h, which can have a significant impact on performance.
I expect that this may not be a representative benchmark because the amount of output matters here, so choosing very low compression levels will probably exaggerate our overhead. It also would not properly benchmark the output buffer code as the buffer sizes could be significantly larger than real world scenarios. What I'd like to see is benchmarking where the library version stays the same and we vary the size of data, compression level, and potentially some compression flags. The point is not to compare among these configurations, but rather to be used to compare between changes in stdlib code across several different usage scenarios.
I think that could also be interesting, but it uses unstable libzstd APIs which I do not want the stdlib to use, and builds against the latest version of libzstd. I expect comparisons there might be tricky. |
Also I just re-read my original message and realized it could be read to mean that I want to run lzbench or tests like it to check the performance of underlying compression libraries. That's not what I want! Sorry for any confusion. I was merely calling it out as prior art/inspiration and that the Silesia corpus is probably a good dataset to use when we write our own benchmarks |
At PyConUS, I was chatting with @gpshead about adding compression benchmarks. While a lot of the "heavy lifting" of compression happens in the libraries CPython binds (zlib, liblzma, etc.), the handling of output buffers in CPython has a significant impact in performance, and is something we don't have a lot of visibility into the performance of.
One of the better known cross-algorithm compression benchmarks I'm aware of is lzbench, which tests compression performance of the Silesia compression corpus across many algorithms. I figure running compression benchmarks at varied settings on Silesia would provide a good starting point for benchmarking the output buffer and other binding code.
The text was updated successfully, but these errors were encountered: