Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

introduce EOTDL_CACHE_PATH to optionally support additional cache #284

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

achtsnits
Copy link
Collaborator

introduce EOTDL_CACHE_PATH to optionally support additional cache (deduplicated files) allowing to create symlinks from individual versioned dataset/model files

goal:

EOTDL already stores data in a deduplicated manner server-side (in object storage), tracking changes at the dataset/model level by identifying which files have changed between versions and maintaining versioning for individual files. On the metadata side, each dataset/model version tracks the specific versions of individual files that belong to it, ensuring precise file-to-version mapping.

For example, in object storage, files might appear as file-a.jpg_1, file-a.jpg_2, file-b.jpg_1, and file-c.jpg_1. A dataset x version 1 could reference file-a.jpg_1 and file-b.jpg_1, while dataset x version 2 might reference the updated file-a.jpg_2 and switch to file-c.jpg_1 instead of file-b.jpg_1. Locally, all files are currently downloaded to ~/.cache/eotdl/datasets or ~/.cache/eotdl/models, with dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) containing full copies of the files.

This approach, which uses dataset/model-level granularity, can be inefficient for datasets with minor changes, such as label updates, as it creates entirely new dataset versions.

This pull request introduces an optional feature to enable finer-grained, individual file versioning on the client side by configuring a global deduplicated cache path (EOTDL_CACHE_PATH, disabled by default). Deduplicated files are stored in this global cache, while the dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) contain symlinks pointing to the appropriate file versions (e.g., file-a.jpg -> file-a.jpg_2).

This feature gives local users some optional flexibility, but more important it provides benefits for platform operators bundling the EOTDL CLI or Python library. By configuring a global shared read-only cache, operators can make data available in ~/.cache/eotdl/datasets/x, and if changes are needed, symbolic links can be replaced with the updated file content, allowing to support different versions of datasets/models for all the platform users at the same time

…duplicated files) allowing to create symlinks from individual versioned dataset/model files
Copy link

vercel bot commented Jan 25, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
eotdl ❌ Failed (Inspect) Jan 25, 2025 0:19am

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants