On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

Beiri22 · 2025-01-06T23:09:09Z

Have you checked borgbackup docs, FAQ, and open GitHub issues?

yes, hopefully good enough

Is this a BUG / ISSUE report or a QUESTION?

Feature Proposal

Problem Description

Modern document formats are compressed containers (e.g., ZIP-based) that hinder deduplication and compression in BorgBackup. Minor changes in content often result in significantly different binary representations, reducing storage efficiency. Could we improve deduplication and compression efficiency for inherently compressed document formats like .docx, .pptx, and .odp by decompressing and recompressing them without compression (store) during the backup process?

Proposed Solution

Implement an optional feature to:

Detect compressed container formats during backup.
Decompress them on-the-fly.
Recompress with "store" mode (no compression) for efficient deduplication and compression.

Benefits

Enhanced deduplication by exposing internal structures.
Improved compression ratios.
Reduced storage requirements for frequently updated files.

Challenges and Mitigation

What to do on extraction: Leave them uncompressed? Recompress to certain level?

The text was updated successfully, but these errors were encountered:

ThomasWaldmann · 2025-01-06T23:41:52Z

IIRC there was a similar discussion about image recompression, maybe you can find it.

RonnyPfannschmidt · 2025-01-07T16:43:47Z

I believe the conclusion was that bit identical reconstruction was more important and sometimes different compression implementations produce different but valid results

Beiri22 · 2025-01-07T19:04:18Z

My issue was more about exposing the internal structures of those files, as they may change little, but due to compression result in very different documents that do not well deduplicate

RonnyPfannschmidt · 2025-01-07T19:08:49Z

IMHO that's a job for document management

Backup needs to be bit identical

ThomasWaldmann · 2025-01-07T21:22:16Z

@Beiri22 a backup tool must restore the bit-identical file, as it was at the time of back up.

so, if we "unzip" to expose the raw files / boundaries between files and back up that, we would need to "zip" at extract time. And that could likely result in a not identical zip archive, even if the contents are the same.

Beiri22 · 2025-01-07T22:23:11Z

I see; but for archival purposes this could be a good - lets say - separate tool prepare your data

ThomasWaldmann · 2025-01-08T17:57:32Z

Related: #63

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

Beiri22 commented Jan 6, 2025

ThomasWaldmann commented Jan 6, 2025

RonnyPfannschmidt commented Jan 7, 2025

Beiri22 commented Jan 7, 2025

RonnyPfannschmidt commented Jan 7, 2025

ThomasWaldmann commented Jan 7, 2025

Beiri22 commented Jan 7, 2025

ThomasWaldmann commented Jan 8, 2025

On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

Comments

Beiri22 commented Jan 6, 2025

Have you checked borgbackup docs, FAQ, and open GitHub issues?

Is this a BUG / ISSUE report or a QUESTION?

Problem Description

Proposed Solution

Benefits

Challenges and Mitigation

ThomasWaldmann commented Jan 6, 2025

RonnyPfannschmidt commented Jan 7, 2025

Beiri22 commented Jan 7, 2025

RonnyPfannschmidt commented Jan 7, 2025

ThomasWaldmann commented Jan 7, 2025

Beiri22 commented Jan 7, 2025

ThomasWaldmann commented Jan 8, 2025