Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

Merged
merged 11 commits into from
Feb 18, 2025

Conversation

corneliusroemer
Copy link
Contributor

@corneliusroemer corneliusroemer commented Feb 17, 2025

Mirror NCBI datasets zip folders to Hetzner object storage to reduce load on NIH servers and uncouple our CI from external services.

Right now all Loculus/PPX organisms are covered, we can easily add Taxon IDs for Genspectrum ones @anna-parker

It scrapes once a day. The zip folder can be downloaded at:

https://hel1.your-objectstorage.com/loculus-public/mirror/{TAXONID}.zip

Where TAXONID is the taxonid we scrape, e.g. 10244 for mpox.

As a next step, we can optionally use these paths to download the zip folders in ingest.

What happens if I overwrite an object and the upload fails midway?

  • Single PUT: The old object remains unchanged; no partial overwrite.
  • Multipart Upload: Incomplete parts stay temporarily; must complete or abort the upload.

What happens if someone downloads while I overwrite an object?

  • They get the old version until the new upload is fully completed.
  • No partial or mixed content; updates are atomic.

Does S3 allow partial writes when overwriting?

  • No, an object is only replaced when the upload fully succeeds.

@corneliusroemer corneliusroemer changed the title Datasets mirroring feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service Feb 18, 2025
Copy link
Member

@theosanderson theosanderson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@corneliusroemer corneliusroemer merged commit 5a041d8 into main Feb 18, 2025
16 checks passed
@corneliusroemer corneliusroemer deleted the datasets-mirroring branch February 18, 2025 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants