Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

On-the-Fly Decompression of Compressed Document Formats for Improved Deduplication and Compression #8627

Open
Beiri22 opened this issue Jan 6, 2025 · 7 comments

Comments

@Beiri22
Copy link

Beiri22 commented Jan 6, 2025

Have you checked borgbackup docs, FAQ, and open GitHub issues?

yes, hopefully good enough

Is this a BUG / ISSUE report or a QUESTION?

Feature Proposal

Problem Description

Modern document formats are compressed containers (e.g., ZIP-based) that hinder deduplication and compression in BorgBackup. Minor changes in content often result in significantly different binary representations, reducing storage efficiency. Could we improve deduplication and compression efficiency for inherently compressed document formats like .docx, .pptx, and .odp by decompressing and recompressing them without compression (store) during the backup process?

Proposed Solution

Implement an optional feature to:

  • Detect compressed container formats during backup.
  • Decompress them on-the-fly.
  • Recompress with "store" mode (no compression) for efficient deduplication and compression.

Benefits

  • Enhanced deduplication by exposing internal structures.
  • Improved compression ratios.
  • Reduced storage requirements for frequently updated files.

Challenges and Mitigation

  • What to do on extraction: Leave them uncompressed? Recompress to certain level?
@ThomasWaldmann
Copy link
Member

IIRC there was a similar discussion about image recompression, maybe you can find it.

@RonnyPfannschmidt
Copy link
Contributor

I believe the conclusion was that bit identical reconstruction was more important and sometimes different compression implementations produce different but valid results

@Beiri22
Copy link
Author

Beiri22 commented Jan 7, 2025

My issue was more about exposing the internal structures of those files, as they may change little, but due to compression result in very different documents that do not well deduplicate

@RonnyPfannschmidt
Copy link
Contributor

IMHO that's a job for document management

Backup needs to be bit identical

@ThomasWaldmann
Copy link
Member

@Beiri22 a backup tool must restore the bit-identical file, as it was at the time of back up.

so, if we "unzip" to expose the raw files / boundaries between files and back up that, we would need to "zip" at extract time. And that could likely result in a not identical zip archive, even if the contents are the same.

@Beiri22
Copy link
Author

Beiri22 commented Jan 7, 2025

I see; but for archival purposes this could be a good - lets say - separate tool prepare your data

@ThomasWaldmann
Copy link
Member

Related: #63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants