introduce EOTDL_CACHE_PATH to optionally support additional cache #284

achtsnits · 2025-01-25T12:19:06Z

introduce EOTDL_CACHE_PATH to optionally support additional cache (deduplicated files) allowing to create symlinks from individual versioned dataset/model files

goal:

EOTDL already stores data in a deduplicated manner server-side (in object storage), tracking changes at the dataset/model level by identifying which files have changed between versions and maintaining versioning for individual files. On the metadata side, each dataset/model version tracks the specific versions of individual files that belong to it, ensuring precise file-to-version mapping.

For example, in object storage, files might appear as file-a.jpg_1, file-a.jpg_2, file-b.jpg_1, and file-c.jpg_1. A dataset x version 1 could reference file-a.jpg_1 and file-b.jpg_1, while dataset x version 2 might reference the updated file-a.jpg_2 and switch to file-c.jpg_1 instead of file-b.jpg_1. Locally, all files are currently downloaded to ~/.cache/eotdl/datasets or ~/.cache/eotdl/models, with dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) containing full copies of the files.

This approach, which uses dataset/model-level granularity, can be inefficient for datasets with minor changes, such as label updates, as it creates entirely new dataset versions.

This pull request introduces an optional feature to enable finer-grained, individual file versioning on the client side by configuring a global deduplicated cache path (EOTDL_CACHE_PATH, disabled by default). Deduplicated files are stored in this global cache, while the dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) contain symlinks pointing to the appropriate file versions (e.g., file-a.jpg -> file-a.jpg_2).

This feature gives local users some optional flexibility, but more important it provides benefits for platform operators bundling the EOTDL CLI or Python library. By configuring a global shared read-only cache, operators can make data available in ~/.cache/eotdl/datasets/x, and if changes are needed, symbolic links can be replaced with the updated file content, allowing to support different versions of datasets/models for all the platform users at the same time

…duplicated files) allowing to create symlinks from individual versioned dataset/model files

vercel · 2025-01-25T12:19:11Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
eotdl	❌ Failed (Inspect)			Jan 25, 2025 0:19am

introduce EOTDL_CACHE_PATH to optionally support additional cache (de…

b48b58e

…duplicated files) allowing to create symlinks from individual versioned dataset/model files

achtsnits requested a review from juansensio January 25, 2025 12:19

achtsnits assigned juansensio Jan 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

introduce EOTDL_CACHE_PATH to optionally support additional cache #284

introduce EOTDL_CACHE_PATH to optionally support additional cache #284

achtsnits commented Jan 25, 2025

vercel bot commented Jan 25, 2025

introduce EOTDL_CACHE_PATH to optionally support additional cache #284

Are you sure you want to change the base?

introduce EOTDL_CACHE_PATH to optionally support additional cache #284

Conversation

achtsnits commented Jan 25, 2025

vercel bot commented Jan 25, 2025