introduce EOTDL_CACHE_PATH to optionally support additional cache #284
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
introduce EOTDL_CACHE_PATH to optionally support additional cache (deduplicated files) allowing to create symlinks from individual versioned dataset/model files
goal:
EOTDL already stores data in a deduplicated manner server-side (in object storage), tracking changes at the dataset/model level by identifying which files have changed between versions and maintaining versioning for individual files. On the metadata side, each dataset/model version tracks the specific versions of individual files that belong to it, ensuring precise file-to-version mapping.
For example, in object storage, files might appear as file-a.jpg_1, file-a.jpg_2, file-b.jpg_1, and file-c.jpg_1. A dataset x version 1 could reference file-a.jpg_1 and file-b.jpg_1, while dataset x version 2 might reference the updated file-a.jpg_2 and switch to file-c.jpg_1 instead of file-b.jpg_1. Locally, all files are currently downloaded to ~/.cache/eotdl/datasets or ~/.cache/eotdl/models, with dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) containing full copies of the files.
This approach, which uses dataset/model-level granularity, can be inefficient for datasets with minor changes, such as label updates, as it creates entirely new dataset versions.
This pull request introduces an optional feature to enable finer-grained, individual file versioning on the client side by configuring a global deduplicated cache path (EOTDL_CACHE_PATH, disabled by default). Deduplicated files are stored in this global cache, while the dataset version folders (e.g., ~/.cache/eotdl/datasets/x/v1) contain symlinks pointing to the appropriate file versions (e.g., file-a.jpg -> file-a.jpg_2).
This feature gives local users some optional flexibility, but more important it provides benefits for platform operators bundling the EOTDL CLI or Python library. By configuring a global shared read-only cache, operators can make data available in ~/.cache/eotdl/datasets/x, and if changes are needed, symbolic links can be replaced with the updated file content, allowing to support different versions of datasets/models for all the platform users at the same time