Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Osdf cache support #221

Merged
merged 4 commits into from
Dec 16, 2024
Merged

Osdf cache support #221

merged 4 commits into from
Dec 16, 2024

Conversation

atripathy86
Copy link
Contributor

Related Issues / Pull Requests

#220

Description

Adds functionality for downloading and verifying OSDF artifacts from caches with MD5 hash validation

What changes are proposed in this pull request?

  • Bug fix (non-breaking change which fixes an issue).
  • New feature (non-breaking change which adds functionality).
  • Breaking change (fix or feature that would cause existing functionality to not work as expected; for instance,
    examples in this repository need to be updated too).
  • This change requires a documentation update.

Checklist:

  • My code follows the style guidelines of this project (PEP-8 with Google-style docstrings).
  • My code modifies existing public API, or introduces new public API, and I updated or wrote docstrings that
    uses Google-style formatting and any other formatting that is supported by mkdocs and plugins this project
    uses.
  • I have commented my code.
  • My code requires documentation updates, and I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works.
  • New and existing unit tests pass locally with my changes.

Example Usage

  • Enhances init osdfremote command with cache as optional argument
$ cmf init osdfremote --path https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test \
          --cache https://osdf-director.osg-htc.org/ \
        --key-id c2a5 \
        --key-path ~/.ssh/fdp.pem \
        --key-issuer https://t.nationalresearchplatform.org/fdp \
--git-remote-url https://github.com/user/experiment-repo.git
git_dir /home/tripataa/osdf_pull_test/.git
Starting git init.
*** Note: CMF will check out a new branch in git to commit the metadata files ***
*** The checked out branch is master. ***
git init complete.
Starting cmf init.
Setting 'osdf' as a default remote.
cmf init complete.
  • Example of output when pulling from cache
$ cmf artifact pull -p Test-env
Fetching artifact=artifacts/data.xml.gz, surl=https://osdf-director.osg-htc.org/nrp/fdp/cmf_test/23/6d9502e0283d91f689d7038b8508a2 to /home/tripataa/osdf_pull_test/artifacts/data.xml.gz
object artifacts/data.xml.gz downloaded at /home/tripataa/osdf_pull_test/artifacts/data.xml.gz in 0.05 seconds and matches MLMD records.
Fetching artifact=artifacts/parsed/train.tsv, surl=https://osdf-director.osg-htc.org/nrp/fdp/cmf_test/32/b715ef0d71ff4c9e61f55b09c15e75 to /home/tripataa/osdf_pull_test/artifacts/parsed/train.tsv
object artifacts/parsed/train.tsv downloaded at /home/tripataa/osdf_pull_test/artifacts/parsed/train.tsv in 0.09 seconds and matches MLMD records.
Fetching artifact=artifacts/parsed/test.tsv, surl=https://osdf-director.osg-htc.org/nrp/fdp/cmf_test/6f/597d341ceb7d8fbbe88859a892ef81 to /home/tripataa/osdf_pull_test/artifacts/parsed/test.tsv
object artifacts/parsed/test.tsv downloaded at /home/tripataa/osdf_pull_test/artifacts/parsed/test.tsv in 0.02 seconds and matches MLMD records.
Done
  • Example of output when pulling from cache fails, so it reverts to pull from specified origin
    • Failure could happen due to hash mismatch or timeout, The example below shows a wrong cache path, And the recovery to
      an origin
$ cmf artifact pull -p Test-env
Fetching artifact=artifacts/data.xml.gz, surl=https://osdf-director.osg-htc.org//nrp/fdp/cmf_test/nrp/fdp/cmf_test/23/6d9502e0283d91f689d7038b8508a2 to /home/tripataa/osdf_pull_test/artifacts/data.xml.gz
Failed to download and verify file from cache: The request timed out.
Trying Origin at https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/23/6d9502e0283d91f689d7038b8508a2
Fetching artifact=artifacts/data.xml.gz, surl=https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/23/6d9502e0283d91f689d7038b8508a2 to /home/tripataa/osdf_pull_test/artifacts/data.xml.gz
object artifacts/data.xml.gz downloaded at /home/tripataa/osdf_pull_test/artifacts/data.xml.gz in 0.05 seconds and matches MLMD records.
Fetching artifact=artifacts/parsed/train.tsv, surl=https://osdf-director.osg-htc.org//nrp/fdp/cmf_test/nrp/fdp/cmf_test/32/b715ef0d71ff4c9e61f55b09c15e75 to /home/tripataa/osdf_pull_test/artifacts/parsed/train.tsv
Failed to download and verify file from cache: The request timed out.
Trying Origin at https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/32/b715ef0d71ff4c9e61f55b09c15e75
Fetching artifact=artifacts/parsed/train.tsv, surl=https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/32/b715ef0d71ff4c9e61f55b09c15e75 to /home/tripataa/osdf_pull_test/artifacts/parsed/train.tsv
object artifacts/parsed/train.tsv downloaded at /home/tripataa/osdf_pull_test/artifacts/parsed/train.tsv in 0.09 seconds and matches MLMD records.
Fetching artifact=artifacts/parsed/test.tsv, surl=https://osdf-director.osg-htc.org//nrp/fdp/cmf_test/nrp/fdp/cmf_test/6f/597d341ceb7d8fbbe88859a892ef81 to /home/tripataa/osdf_pull_test/artifacts/parsed/test.tsv
Failed to download and verify file from cache: The request timed out.
Trying Origin at https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/6f/597d341ceb7d8fbbe88859a892ef81
Fetching artifact=artifacts/parsed/test.tsv, surl=https://sdsc-origin.nationalresearchplatform.org:8443/nrp/fdp/cmf_test/6f/597d341ceb7d8fbbe88859a892ef81 to /home/tripataa/osdf_pull_test/artifacts/parsed/test.tsv
object artifacts/parsed/test.tsv downloaded at /home/tripataa/osdf_pull_test/artifacts/parsed/test.tsv in 0.02 seconds and matches MLMD records.
Done

…aches with MD5 hash validation

- Add optional argument `--cache` for specifying the OSDF cache path.
- Implement `download_and_verify_file` function to download a file,
write it to disk, and verify its MD5 hash.
- Identify Hash recorded in MLMD file.
- If Calculated hash and recorded hash match, cmf artifact pull
  is successfull
- If Cache is not specified or hashes don't match, pull from origin and
  assume it is correct
- Added --cache as optional argument in `cmf_client.md`
- If cache or redirector URL is not specified, cmf will fetch
pulls from the origin recorded in the MLMD file
- Parse user supplied cache to its schema+netloc only. Skip the path
- Assumption is that the path from MLMD will be more accurate
- cached_url is the netlocation of the supplied cache and path from the MLMD
  records
@atripathy86
Copy link
Contributor Author

A comment got left out. Restored it from main. Therefore needs the merge commit.

@annmary-roy annmary-roy merged commit bf9c559 into master Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants