Open-Source Compression

Open-Source Compression

Open-Source Compression

Table of contents:-

    A Legacy Rooted in Unix

The Modern Toolkit – XZ, LZ4, and Zstandard

When Compression Meets Security

Choosing the Right Tool

Conclusion

There is a quiet kind of magic happening every time you download a kernel tarball, install a package from your distro's repository, or pull a container image. Bytes are shuffled, redundancies stripped out, and data collapses to a fraction of its original size — all before it even touches your storage. That magic has a name, and it belongs firmly to the open-source world. Compression is not glamorous, but without it, the modern software ecosystem as we know it would be considerably slower, costlier, and more cumbersome. This article takes a friendly but thorough look at the open-source compression tools that power BSD, Linux, Unix, and independent distributions worldwide — who built them, how they work, and what every user and administrator ought to know about them.


A Legacy Rooted in Unix

The story of open-source compression begins with a problem: patents. The early Unix compress utility, widely used across Unix-like systems, relied on the Lempel–Ziv–Welch (LZW) algorithm — which was encumbered by patents held by Unisys and IBM that did not expire until 2003 and 2004 respectively. That constraint spurred the free software community to build its own alternatives, and the first significant fruit of that effort arrived on 31 October 1992.

That was the day Jean-Loup Gailly and Mark Adler publicly released version 0.1 of gzip — GNU Zip — as part of the GNU Project. Gailly designed the file format, later formalised as RFC 1952, whilst Adler wrote the decompression routines. At its core, gzip uses the DEFLATE algorithm, a combination of LZ77 dictionary-based matching and Huffman coding that offered a clean, patent-free alternative to LZW. It became an immediate fixture in the Unix and Linux communities and remains so today. The latest stable release, version 1.14, arrived in April 2025, and the project is maintained under the GNU General Public Licence v3 or later. For practical purposes, gzip supports nine compression levels: the default (level 6) is a sensible balance between speed and file size reduction, whilst level 9 maximises compression at the cost of more CPU time.

gzip is fine for individual files, but it was never designed to be an archiver. That is where tar enters the picture — not as a compressor itself, but as an archiving layer that bundles multiple files or directories before a compressor is applied. The classic .tar.gz (or .tgz) combination of tar and gzip became the lingua franca of source code distribution in the open-source world, a convention that endures to this day.

By 1996, Julian Seward had introduced bzip2, a free and open-source alternative that took a different algorithmic path. Released initially in July 1996, bzip2 uses the Burrows–Wheeler transform (BWT) — a reversible block-sorting technique introduced in a 1994 technical report by Michael Burrows and David J. Wheeler at Digital Equipment Corporation — followed by a move-to-front transform and Huffman coding. The result is that bzip2 typically achieves meaningfully better compression ratios than gzip, particularly on structured text and source code, though it does so at the cost of slower compression speed and higher memory use. Files are processed in blocks ranging from 100 kB to 900 kB. Version 1.0 arrived in late 2000, and the most recent stable release, 1.0.8, came in July 2019. Seward's creation spread rapidly across Unix-like systems and became the compression format of choice for many source tarballs and distro packages through the 2000s. The bzip2 licence is a modified zlib licence, making it permissively distributable.

Together, gzip and bzip2 established the cultural norms around open-source compression — single-file compressors paired with tar for archiving, straightforward command-line interfaces, and freedom from proprietary encumbrances.


The Modern Toolkit – XZ, LZ4, and Zstandard

As storage and bandwidth evolved, the appetite for even better compression grew — and so did the tools to satisfy it. Three compressors in particular have defined the contemporary open-source landscape: XZ Utils, LZ4, and Zstandard.

XZ Utils — maintained by Lasse Collin and The Tukaani Project — is the successor to LZMA Utils and provides lossless compression and decompression for Unix-like operating systems, with Windows support from version 5.0 onwards. Its native format, .xz, uses the LZMA2 algorithm, which routinely achieves compression ratios that surpass both gzip and bzip2, particularly for binary executables and kernel images. The trade-off is significant: xz is considerably slower to compress than either of its predecessors and can consume upwards of 600 MB of RAM at the highest compression settings (level 9e, or "extreme" mode). Decompression, however, is comparatively fast. The Linux kernel ships an embedded XZ decompressor — XZ Embedded — for decompressing kernel images and initramfs archives, and .xz is the standard format for Linux kernel release tarballs distributed from kernel.org. The most recent stable release at the time of writing is 5.8.2, published in December 2025, and is available from the Tukaani Project's GitHub repository.

LZ4 takes the opposite philosophical approach. Created by Yann Collet — who would later also develop Zstandard — and first released in April 2011, LZ4 is a lossless compression algorithm that unapologetically prioritises speed over ratio. It belongs to the LZ77 family of byte-oriented compression schemes and deliberately omits an entropy coding stage (such as Huffman coding), trading some compression density for raw throughput. In practice, LZ4 delivers compression speeds exceeding 500 MB/s per core and decompression speeds reaching multiple GB/s per core — approaching RAM speed limits on multi-core systems. This makes it a compelling choice wherever data needs to be compressed and decompressed at near-wire speed: real-time logging pipelines, fast filesystems, and in-memory caching systems, for example. The Linux kernel has supported LZ4 for SquashFS since version 3.19-rc1, and the ZFS filesystem implementations on FreeBSD, Illumos, and Linux all support LZ4 for on-the-fly compression. The reference implementation is written in C and licensed under the BSD 2-Clause licence, with ports and bindings available in Java, C#, Rust, Python, and others. The latest stable release is 1.10.0, from July 2024.

Zstandard, commonly known as zstd, is perhaps the most consequential compression development of the past decade. It was developed by Yann Collet at Meta (then Facebook) and open-sourced in August 2016. Its design goal was direct and ambitious: to improve simultaneously on compression speed, decompression speed, and compression ratio relative to zlib — the ubiquitous library that underlies gzip. Zstandard uses LZ77-style dictionary matching combined with a larger search window and a fast entropy coder based on Finite State Entropy (FSE), a variant of Asymmetric Numeral Systems (ANS). The algorithm was published as IETF RFC 8478 in 2018 and subsequently updated by RFC 8878. The reference library is dual-licensed under BSD-3-Clause or GPL-2.0-or-later and is written in C. Critically, zstd supports an unusually wide range of compression levels — from negative levels that trade ratio for maximum speed, through to level 22 and beyond with the --ultra flag — making it genuinely versatile across use cases. It also features a dictionary compression mode designed specifically to improve performance on small data payloads, a common and previously awkward problem for general-purpose compressors.

Adoption of Zstandard has been sweeping. The Linux kernel uses it for module and filesystem compression; Fedora switched its RPM package compression to zstd as far back as 2019; Arch Linux and Ubuntu have similarly adopted it for packages; the Btrfs filesystem supports zstd natively for on-the-fly data compression; and both Chrome and Firefox added Content-Encoding: zstd HTTP support in 2024. Meta uses it across its entire data infrastructure, and it has found homes in database systems such as Redis and Apache Hadoop. The most recent stable release at the time of writing is 1.5.7, published in February 2025.

For GNU and BSD tar users, practical integration with all these tools is straightforward: tar czf for gzip, tar cjf for bzip2, tar cJf for xz, and tar --zstd -cf for Zstandard. BSD tar additionally supports LZ4 natively, whilst GNU tar can achieve equivalent results via --use-compress-program=lz4.


When Compression Meets Security

The concentration of compression tooling in critical infrastructure — package managers, kernel images, SSH-linked libraries — means that vulnerabilities in these tools carry extraordinary blast radius. That reality was brought into sharp focus in March 2024 with the discovery of CVE-2024-3094, commonly known as the XZ Utils backdoor.

On 29 March 2024, Andres Freund, a software engineer at Microsoft, publicly disclosed that versions 5.6.0 and 5.6.1 of XZ Utils had been deliberately tampered with. The backdoor was inserted into the release tarballs of the project — not the public git repository — and exploited the indirect linkage of liblzma (the compression library at XZ Utils' core) into OpenSSH via systemd on affected Linux systems. When present, it allowed an attacker possessing a specific Ed448 private key to execute arbitrary remote code, bypassing SSH authentication entirely. The vulnerability received a CVSS score of 10 — the maximum possible. CISA advised administrators to downgrade to a version of XZ Utils earlier than 5.6.0. The backdoor was discovered only because Freund noticed anomalous CPU usage during routine SSH benchmarking — a matter of technical luck rather than systemic detection.

The supply chain dimensions of the attack are particularly sobering. The threat actor, operating under the name "Jia Tan," spent almost two years cultivating trust within the XZ project, gradually assuming maintainer responsibilities from Lasse Collin — who had openly disclosed personal difficulties including mental health challenges — before embedding the malicious code in the release tarballs. Multiple fictitious community accounts were used to apply social pressure on Collin to accelerate code merges. The attack exploited not a flaw in the compression algorithm itself, but the structural vulnerability of a widely depended-upon piece of infrastructure maintained by a single, under-resourced individual. The Open Source Security Foundation (OpenSSF) has cited the XZ incident as a defining example of the growing supply chain security challenge facing the ecosystem in 2025 and beyond.

The incident underscores several practical lessons for administrators and users across BSD, Linux, and Unix systems. Verifying package signatures from your distribution's official repositories matters and should be a baseline habit. Keeping systems patched promptly — particularly for foundational libraries like compression tools — is not optional. Where possible, reproducible builds that allow output binaries to be independently verified against source code provide a meaningful additional layer of assurance. The OpenSSF's Scorecard and similar tooling can help organisations assess the security posture of the open-source components in their stacks.


Choosing the Right Tool

Understanding the landscape of compression tools is one thing; knowing which one to reach for in a given situation is another. The choice usually comes down to three variables: the speed of compression, the speed of decompression, and the resulting file size — and these are always in tension with one another.

For long-term archival where file size is the priority and compression time is irrelevant — distributing software releases, storing kernel tarballs, archiving data for cold storage — xz remains the standard choice. Its LZMA2 algorithm consistently achieves the smallest output files of the common open-source compressors, and since archives in this category are compressed once but may be downloaded many thousands of times, the slow compression speed is an acceptable trade-off.

For everyday system administration tasks, log compression, database backups, and general data reduction where both ratio and speed matter, Zstandard is increasingly the right answer. Its ability to scale across a wide range of compression levels — including its fast modes for real-time scenarios — makes it genuinely all-purpose. The broad adoption of zstd across major distributions and filesystems also means that the tooling to work with .zst files is reliably available across the ecosystem.

For scenarios where decompression speed is the dominant concern — live filesystem compression, real-time pipelines, in-memory caching, and anywhere that data is compressed once but decompressed constantly — LZ4 is the obvious candidate. Its multi-GB/s decompression throughput is difficult to match, and its permissive BSD licence means it integrates smoothly into virtually any project.

gzip and bzip2 remain relevant not as primary choices for new work, but because of the vast existing corpus of .gz and .bz2 files in the wild — source tarballs, legacy archives, older distribution packages — and because gzip in particular is essentially universal: every Unix-like system you will ever encounter can handle it. Compatibility remains a genuine virtue.

For users and administrators on BSD systems, it is worth noting that bsdtar (from the libarchive project) provides native support for a broader range of compression formats than GNU tar, including LZ4, and is the default tar implementation on FreeBSD and macOS. libarchive itself is a foundational, cross-platform library underpinning compression and archiving across a wide range of BSD and Linux tools.


Conclusion

Open-source compression is foundational infrastructure: unglamorous, often invisible, and absolutely essential. From the GNU Project's response to patent-encumbered Unix utilities in 1992, through to the widespread adoption of Zstandard across modern Linux distributions and filesystems today, the trajectory has been one of continuous improvement driven by community need and individual ingenuity. The XZ Utils incident of 2024 is a reminder that the same dependence that makes these tools so critical also makes them worthy of ongoing scrutiny, community support, and supply chain vigilance. Knowing your tools — their strengths, their trade-offs, and their maintenance status — is not merely academic. It is good practice.


Disclaimer: All trade names, trademarks, and product names referenced in this article — including but not limited to GNU, gzip, bzip2, XZ Utils, LZ4, Zstandard (zstd), Btrfs, ZFS, FreeBSD, Fedora, Ubuntu, Arch Linux, Debian, Meta, and Microsoft — are the property of their respective owners. The Distrowrite Project strives for accuracy and factual integrity in all published content; however, readers are encouraged to consult official documentation and primary sources for the most current information. Nothing in this article constitutes an endorsement of, or instruction for, any activity involving malware, backdoors, exploits, viruses, or any form of harmful content that may compromise the integrity of networks, devices, or other infrastructure.


References:-


🗜️⚙️📁

Comments

Popular Posts