Bringing OxiPNG to Squoosh

Posted on:

It took only 7.5 months ๐Ÿ˜… (j/k, there were reasons) since opening the PR, but Squoosh.app now utilises OxiPNG instead of OptiPNG for PNG compression!

First attempt

OxiPNG is a Rust alternative to OptiPNG - a popular PNG compressor.

The main benefit of OxiPNG over OptiPNG is it utilises multi-threading on platforms that support it, and we wanted to provide a path to leverage it on WebAssembly.

Unfortunately, during the initial attempt, we found out that OxiPNG always results in a worse compression ๐Ÿ˜ž

Squoosh logo: 33.9k vs 34.3k (1.2% bigger)

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Su25-kompo-vers2.svg/2880px-Su25-kompo-vers2.svg.png 472k vs 573k (21% bigger)

https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/1000px-Google_2015_logo.svg.png No difference.

The above image, but reduced to 53 colours: 10.3k vs 11k (6.8% bigger).

https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Android_new_logo_2019.svg/1000px-Android_new_logo_2019.svg.png also reduced to 53 colours. 7.18k vs 7.66k (6.7% bigger).

https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/1024px-Flag_of_the_United_States.svg.png also reduced to 53 colours. 12.2k vs 12.7k (4% bigger).

https://upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/1024px-Flag_of_the_United_Kingdom.svg.png reduced to 53 colours. 3.48k vs 4.1k (17.8% bigger).

On its own, this could be written off as just an unfortunate difference between libraries, and, as such, a dead end...

What has been surprising though, is that these regressions didn't reproduce on OxiPNG outside of our integration ๐Ÿคจ

I wonder if we're holding it wrong. Using ImageOptim, in "extreme" mode:

Squoosh logo:
Our optipng: 33.9k
Our oxipng: 34.3k
ImageOptim oxi: 33.9k

https://upload.wikimedia.org/wikipedia/commons/thumb/1/1a/Su25-kompo-vers2.svg/2880px-Su25-kompo-vers2.svg.png
Our opti: 472k
Our oxi: 573k
ImageOptim oxi: 574k

https://upload.wikimedia.org/wikipedia/commons/thumb/2/2f/Google_2015_logo.svg/1000px-Google_2015_logo.svg.png reduced to 53 colours
Our opti: 10.3k
Our oxi: 11k
ImageOptim oxi: 10.3k

https://upload.wikimedia.org/wikipedia/en/thumb/a/a4/Flag_of_the_United_States.svg/1024px-Flag_of_the_United_States.svg.png also reduced to 53 colours
Our opti: 12.2k
Our oxi: 12.7k
ImageOptim oxi: 12.2k.

https://upload.wikimedia.org/wikipedia/en/thumb/a/ae/Flag_of_the_United_Kingdom.svg/1024px-Flag_of_the_United_Kingdom.svg.png reduced to 53 colours.
Our opti: 3.48k
Our oxi: 4.1k
ImageOptim oxi: 3.5k

ImageOptim's version/integration of oxi is better than our's. Although there's the odd output size regression, the only major regression in that large airplane image.

There is no way same code compiled to Wasm could result in a worse compression.

Indeed, upon further inspection, it turned out that OxiPNG utilises two different DEFLATE libraries to compress PNG image data, depending on the target platform support.

One is a Rust wrapper for Cloudflare's fork of the common zlib library that provides fantastic compression speed improvements, but works only on x64 and ARM64 targets.

Another is miniz_oxide - a Rust port of a miniz library, which is used on all other platforms by OxiPNG - including our WebAssembly target. miniz is designed to be a fast & tiny drop-in replacement for zlib, but doesn't provide as good compression, which explains the results above.

Once we found this out, there were a few possible paths to explore.

cloudflare-zlib (C fork)

First, cloudflare-zlib could add support for Wasm. Unfortunately, SIMD in WebAssembly is not yet stable, and while it can be experimented with, it wouldn't solve our immediate problem for most users yet.

cloudflare-zlib could also add fallbacks to regular implementation of functions for unsupported targets, but that doesn't align with goals of the fork and would likely result in maintenance complications, so wasn't brought up.

cloudflare-zlib (Rust wrapper)

Second, the Rust wrapper could fall back to a regular zlib on unsupported targets. We talked a bit about this with the author of the wrapper in DMs, but this could be tricky to do in a general case, because the wrapper wants to detect SIMD support at runtime and we can't link to both libraries statically due to symbol conflicts.

It might still work if we decide to use regular zlib at least on platforms that are definitely not supported by cloudflare-zlib, but meanwhile I decided to look into other approaches.

flate2-rs

Third was to make OxiPNG use another wrapper that already abstracts over these libraries statically.

Luckily, a popular flate2-rs library has already supported miniz-oxide as well as zlib, and someone has recently contributed cloudflare-zlib support too, which made it a perfect match!.. Well, almost.

One problem is that OxiPNG, like OptiPNG, works by iterating over various combinations of PNG filters as well as low-level zlib options and essentially brute-forcing its way to the one that works best for the given image.

flate2 exposes these on low-level zlib bindings, but not yet on a high-level API. It wouldn't be hard to propagate these options, but it still requires some extra work.

zlib (raw Rust bindings)

Another problem is that, by default, zlib depends on some libc functions to take care of allocation, deallocation and copying memory around.

Unfortunately, Clang ships with a completely bare-bones wasm32-unknown-unknown target, which doesn't provide any functions or headers, even platform-independent ones. It expects you to bring your own sysroot.

Some projects that provide their own sysroot are Emscripten and WASI SDK, but it feels like an overkill to bring an entire new toolchain to build a library that doesn't even need any platform-specific APIs.

Moreover, Rust provides its own allocator for the wasm32-unknown-unknown target and we would want to reuse it rather than bring an extra libc one (I wonder if two allocators even work on the same Wasm memory).

As it turned out, zlib has already taken care of this back in 2011 - not for WebAssembly, I presume ๐Ÿ˜€ - by introducing a separate compilation "solo" mode that avoids any library dependencies at the cost of losing some high-level utilities as well as helpers that could operate on files, and requires embedders to always specify custom alloc / free functions via options.

Luckily, flate2 already checks all these boxes - it doesn't use any of these helpers anyway and already passes custom allocation functions to reuse Rust allocator - so it can be made to work with this solo mode quite easily.

As the first step, I've made an upstream PR to official "raw" Rust + zlib bindings to add such support.

I've also updated flate2 locally to use these updated bindings, and verified that it finally compiles and works great on wasm32-unknown-unknown target, and even provides some nice size savings for other platforms!

Unfortunately, the raw bindings don't seem to be actively maintained these days, so the PR is still waiting for a review, and I've decided to explore yet another approach meanwhile.

libdeflate

I was looking through alternative wrappers for zlib as well as pure-Rust libraries in case I missed an even better solution.

There were mostly decoder implementations, some basic encoders, the already known bindings to zlib / cloudflare-zlib / miniz, the mentioned miniz port, bindings to the best-in-class-but-unbearably-slow zopfli encoder... and then there was one crate that caught my eye: libdeflater.

From the description:

Rust bindings to libdeflate. A high-performance library for working with gzip/zlib/deflate data.

Warning: libdeflate is for specialized use-cases. You should use something like flate2 if you want a general-purpose deflate library.

libdeflate is optimal in applications that have all input data up and have a mechanism for chunking large input datasets (e.g. genomic bam files, some object stores, specialized backends, game netcode packets). It has a much simpler API than zlib but can't stream data.

Usually I'm sceptical cautious about "high-performance" claims in READMEs, unless they're backed by reproducible benchmarks as well as explanation on which corners has been cut to achieve such performance, but this description checked all the boxes and made the library hard to pass by.

It seemed particularly intriguing that libdeflate chose to focus on optimising compression (both speed- and ratio-wise) just for a fixed-size data, rather than attempting to replicate the full streaming zlib API.

This matches our use-case (fixed-size image data) perfectly, so I've decided to go ahead and try and integrate it with OxiPNG, and compare with cloudflare-zlib implementation on a native x64. Benchmarking it against the corpus of test files in the OxiPNG repo gave mixed, but promising results.

Check out the spreadsheet for raw numbers, or just the highlights below for the improvement/regression distribution diagrams:

Time difference

chart

IDAT size difference

chart

Total size difference

chart

If we ignore some outliers (mainly tiny files that contribute too much when converted to relative differences), it's clear that in most cases OxiPNG + libdeflate provides both better compression ratio and does it even faster than OxiPNG + SIMD-optimised cloudflare-zlib. Seems promising!

If you look at the code in the PR, you might say that it's not a fair comparison, because libdeflate doesn't provide as many fine-tuning knobs as zlib, so OxiPNG ends up iterating over far fewer iterations with libdeflate than it would otherwise.

You would be correct in saying so, and for other cases (cloudflare-)zlib could still be the right choice, but for our use-case it's the end result that matters, and not the speed of a single iteration. We don't really care about the lost fine-tuning knobs either - instead, we just use the maximum compression level (12 in libdeflate) and as long as it does provide better ratio/speed balance, I'd say we take it.

This integration is now merged upstream in OxiPNG, and it was time to test it on the Squoosh PR again.

One limitation of the WebAssembly target is that we can't yet benefit from the SIMD optimisations, so we lose some of the native libdeflate speed improvements. On the other hand, we couldn't benefit from SIMD in cloudflare-zlib either, so I'd say it's a fair game in the end.

Moreover, unlike other mentioned solutions, libdeflate already provides fallbacks for unsupported platforms, so it's one less thing to worry about that made the integration quite straightforward.

Let's take one more look at the updated numbers for files mentioned in the beginning of the post, but now with OptiPNG vs OxiPNG + Libdeflate integration in Squoosh:

File optipng oxipng optipng time (ms) oxipng time (ms)
Device screen demo 1.49MB 1.42MB 76909 6991
Google logo (53 colours) 10.3KB 9.37KB 1975 832
Android logo (53 colours) 7.22KB 6.75KB 1191 412
US flag (53 colours) 12.3KB 10.9KB 2841 963
UK flag (53 colours) 3.48KB 3.72KB 2438 1028

Just like in the general test suite, there are some regressions, but usually we get a better ratio, and always a much higher throughput.

What now?

Now that all the PRs are merged upstream, I have more potential optimisations to play with.

One, already mentioned in the beginning of the post, is utilising WebAssembly threads on supported platforms. Even though we are already much faster (up to 11x for the files above) than where we started at, we could do even better by leveraging OxiPNG multithreading support, now that the basic integration is complete and this path is unlocked.

Another potentially interesting idea is to add a mode that would use both libdeflate and zlib in parallel in OxiPNG.

This would eliminate any regressions by always choosing the best encoder possible, but at the cost of eliminating any speed wins as well (since now we have to try even more different encoder+options combinations than we started with).

Also, it still requires getting at least one of the high-level wrappers to compile with both cloudflare-zlib and regular zlib depending on the target platform, to make it work on WebAssembly. All in all, it's not yet clear if this path is worth pursuing, or it's better to find and report specific poorly compressable patterns to libdeflate and hope that they can be improved upstream.

Finally, we could play with upcoming WebAssembly + SIMD support in both libdeflate and cloudflare-zlib to get even faster single-thread compression. There are some potential big upcoming changes to the bytecode, so we wouldn't want to ship code using SIMD to production yet, but nothing stops us from already playing with it in forks / branches by using corresponding intrinsics.

But these are all separate stories for later ๐Ÿ™‚ Stay tuned and stay safe!


More posts: