CSV files are everywhere. Data lakes, ETL pipelines, analytics exports, research datasets — they're the lingua franca of structured data. And they compress terribly with general-purpose algorithms.

Why General Compressors Struggle with CSV

LZMA and gzip look for repeated byte patterns. But a CSV file's structure isn't at the byte level — it's at the column level. Column 1 might be sequential IDs. Column 3 might be timestamps with millisecond precision. Column 7 might be categorical values from a small dictionary.

General compressors can't exploit any of this. PZIP can.

PZIP's CSV Strategy

Schema detection: Parse headers, detect column types (integer, float, timestamp, categorical, text)
Column separation: Transpose row-major to column-major for better compression
Type-specific encoding:
- Sequential integers → start + step + count (3 numbers instead of N)
- Timestamps → base + deltas (small integers)
- Categoricals → dictionary + indices
- Floats → fixed-point or Gorilla encoding
Residual compression: What's left after extraction compresses even better with LZMA

Results

On 153 real-world CSV files, PZIP achieves up to 68.8% smaller output than LZMA-9. The median improvement is 14.3%.

All results are byte-exact round-trip verified. decompress(compress(file)) == file. Always.

See CSV benchmarks or try it on your own CSV files.