feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

corneliusroemer · 2025-02-17T21:51:25Z

Mirror NCBI datasets zip folders to Hetzner object storage to reduce load on NIH servers and uncouple our CI from external services.

Right now all Loculus/PPX organisms are covered, we can easily add Taxon IDs for Genspectrum ones @anna-parker

It scrapes once a day. The zip folder can be downloaded at:

https://hel1.your-objectstorage.com/loculus-public/mirror/{TAXONID}.zip

Where TAXONID is the taxonid we scrape, e.g. 10244 for mpox.

As a next step, we can optionally use these paths to download the zip folders in ingest.

What happens if I overwrite an object and the upload fails midway?

Single PUT: The old object remains unchanged; no partial overwrite.
Multipart Upload: Incomplete parts stay temporarily; must complete or abort the upload.

What happens if someone downloads while I overwrite an object?

They get the old version until the new upload is fully completed.
No partial or mixed content; updates are atomic.

Does S3 allow partial writes when overwriting?

No, an object is only replaced when the upload fully succeeds.

theosanderson

Nice!

corneliusroemer requested a review from anna-parker February 17, 2025 23:32

corneliusroemer added 11 commits February 18, 2025 20:01

Try out mirroring

b628a6d

fix mamba

5bd5fcc

Let's have some more fun

1b251a3

try one more

59d58fc

getting there

6eb33d9

Scrape all ppx organisms

2e84d12

fix

d26bfba

don't fail fast

33dee43

fix

57fdae6

refactor

c2322c3

remove on push

d2935fe

corneliusroemer force-pushed the datasets-mirroring branch from af7fa34 to d2935fe Compare February 18, 2025 19:01

corneliusroemer changed the title ~~Datasets mirroring~~ feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service Feb 18, 2025

theosanderson approved these changes Feb 18, 2025

View reviewed changes

corneliusroemer merged commit 5a041d8 into main Feb 18, 2025
16 checks passed

corneliusroemer deleted the datasets-mirroring branch February 18, 2025 22:12

pathoplexus-bot mentioned this pull request Feb 19, 2025

Update Loculus version to 15634c pathoplexus/pathoplexus#397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

corneliusroemer commented Feb 17, 2025 •

edited

Loading

theosanderson left a comment

feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

feat(ci): Mirror NCBI datasets results to Hetzner object storage to reduce NIH load and get independence from external service #3709

Conversation

corneliusroemer commented Feb 17, 2025 • edited Loading

What happens if I overwrite an object and the upload fails midway?

What happens if someone downloads while I overwrite an object?

Does S3 allow partial writes when overwriting?

theosanderson left a comment

Choose a reason for hiding this comment

corneliusroemer commented Feb 17, 2025 •

edited

Loading