How PZIP Compresses CSV Files Up to 69% Smaller Than LZMA
PZIP TeamFebruary 4, 2026
CSV files are everywhere. Data lakes, ETL pipelines, analytics exports, research datasets — they're the lingua franca of structured data. And they compress terribly with general-purpose algorithms.
Why General Compressors Struggle with CSV
LZMA and gzip look for repeated byte patterns. But a CSV file's structure isn't at the byte level — it's at the column level. Column 1 might be sequential IDs. Column 3 might be timestamps with millisecond precision. Column 7 might be categorical values from a small dictionary.
General compressors can't exploit any of this. PZIP can.
PZIP's CSV Strategy
- Schema detection: Parse headers, detect column types (integer, float, timestamp, categorical, text)
- Column separation: Transpose row-major to column-major for better compression
- Type-specific encoding:
- Sequential integers → start + step + count (3 numbers instead of N)
- Timestamps → base + deltas (small integers)
- Categoricals → dictionary + indices
- Floats → fixed-point or Gorilla encoding
- Residual compression: What's left after extraction compresses even better with LZMA
Results
On 153 real-world CSV files, PZIP achieves up to 68.8% smaller output than LZMA-9. The median improvement is 14.3%.
All results are byte-exact round-trip verified. decompress(compress(file)) == file. Always.