Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make pipeline version per Organism #3534

Open
wants to merge 42 commits into
base: main
Choose a base branch
from
Open

Conversation

fhennig
Copy link
Contributor

@fhennig fhennig commented Jan 16, 2025

resolves #1728

preview URL: https://pipeline-v-per-organism.loculus.org

Summary

The current_processing_pipeline table now has a new column: organism. We now store the current pipeline for each organism.

TODO: How does this affect users? Cornelius said there was previously a constraint to have to use the same version in the values.yaml for all organisms, does that still need to be removed?

Implementation notes

Before this change, we stored "V1" in this table (current_processing_pipeline) - that was simple and could be done without knowing the user configuration. Some of the additional complexity in this PR arises from having to configure a version for the CurrentPipelineVersion for each organism (and also i.e. having to reset this in the tests).

Future work could either include making entries in the table optional, and only filling it as soon as the first sequence for an organism is submitted, or maybe introducing a mechanism to register pipeline versions via API. I opted against making any of these changes to keep the PR small and not change too much of how things are currently functioning.

Testing

  • Exisiting tests were updated
  • A backend test was added to check that if one pipeline version updates for one organism, another organisms pipeline version is not changed.
  • A backend test was added to check that if the initialization runs again, we don't insert v1 rows for organisms that are already in the table.

Docs

Document pipeline version concept better. Relevant link: https://github.com/loculus-project/loculus/blob/main/preprocessing/specification.md#reprocessing

Screenshot

backend change - n/a

PR Checklist

  • All necessary documentation has been adapted.
  • The implemented feature is covered by an appropriate test.

@fhennig fhennig self-assigned this Jan 16, 2025
@fengelniederhammer
Copy link
Contributor

Wow, the pipeline version in the backend is not per organism? But in the values.yml it's per organism, isn't it?

@fhennig
Copy link
Contributor Author

fhennig commented Jan 16, 2025

What do you mean? The pipeline version is internal, so it doesn't appear in the values.yaml if I'm not mistaken.

@chaoran-chen
Copy link
Member

The pipeline version also occurs in the Values.yaml because it's also used to deploy the pipelines.

@fengelniederhammer
Copy link
Contributor

preprocessing:
- &preprocessing
version: 1

This is the pipeline version, isn't it?

@theosanderson
Copy link
Member

theosanderson commented Jan 16, 2025

Currently for the backend (/in the database) there is a single "latest pipeline version". So if you make a new pipeline version for one organism in the values yaml you have to do so for all organisms. This issue is (intended to be) about addressing that.

@fhennig
Copy link
Contributor Author

fhennig commented Jan 23, 2025

Note to self: The table now starts out empty, somehow it needs to be filled. (I don't think there is a way around the table starting empty now? Because we don't know which organisms there are)

@fhennig fhennig force-pushed the pipeline-v-per-organism branch from 170d08b to b6997b9 Compare January 23, 2025 11:54
@fhennig
Copy link
Contributor Author

fhennig commented Jan 23, 2025

Update: I found a way to initialize the table on backend-startup.

Now I marked some places in the schema that still need to be adjusted, that currently lead to: "ERROR: more than one row returned by a subquery used as an expression"

@fhennig fhennig added the preview Triggers a deployment to argocd label Jan 23, 2025
@fhennig
Copy link
Contributor Author

fhennig commented Jan 23, 2025

From all the places I had to touch, I think the general handling of this table isn't the best unfortunately. It is different to all the other tables because it starts out with data, and all the others don't.

@fhennig fhennig force-pushed the pipeline-v-per-organism branch from 1164bc1 to 8009b7f Compare January 28, 2025 10:08
@fhennig
Copy link
Contributor Author

fhennig commented Jan 28, 2025

Ok at least some of the test failures are because the flyway property I added cannot be injected. maybe I can create a dummy bean for it in the test, or I can somehow make the dependency optional.

Ideally I'd solve the problem of "run x after flyway is finished running" in a different way maybe? Maybe the PostConstruct shouldn't be in the SubmissionDatabaseService, but somewhere else.

Also for anyone who is following along, I wanted to do it this way as a quick way to get it to run, and then iterate on it to make it prettier/less invasive, just so y'all know.

@fhennig fhennig force-pushed the pipeline-v-per-organism branch from fdc37b2 to 3be6556 Compare January 28, 2025 14:11
@fhennig fhennig marked this pull request as ready for review January 29, 2025 09:28
@fhennig
Copy link
Contributor Author

fhennig commented Jan 29, 2025

I've moved the PR out of Draft, because I'm done with the code+tests. I'll look at the docs today.

@fhennig
Copy link
Contributor Author

fhennig commented Jan 29, 2025

I have added some docs stuff:

image

Not sure if the new subsection is a good idea or not. But I added a page to explain a bit more what a preprocessing pipeline even is, taking some of the text from the specification.

I imagine that in the near future it'd be good to have a couple of sections like "setup", "preprocessing pipeline", "user management", "" sort of like chapters in a manual.

@fhennig fhennig requested a review from chaoran-chen January 29, 2025 10:24
Copy link
Member

@chaoran-chen chaoran-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much, Felix! The code looks good to me and the tests are reassuring!

As it is a rather complicated change and it is not easy to test manually, I think that it would be great to have another review by someone. @corneliusroemer, @theosanderson, would you like to take a look?

Copy link
Contributor

@corneliusroemer corneliusroemer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll review this by Thursday EOD

@corneliusroemer
Copy link
Contributor

corneliusroemer commented Jan 29, 2025

A simple/cheap way to test is to use different values for different pipelines in our default values.yaml

There are lines like this per organism (right now version is shared, we can now undo this constraint)

  preprocessing:
    - &preprocessing
      version: 1
      image: ghcr.io/loculus-project/preprocessing-nextclade

Update: I've now done this in dfdc7d3

@fhennig fhennig force-pushed the pipeline-v-per-organism branch from 5d6d7df to e6d12c1 Compare January 31, 2025 11:51
@fhennig
Copy link
Contributor Author

fhennig commented Jan 31, 2025

That's not quite true, it changes what's allowed in the values.yaml: processing version no longer has to be identical across all organisms.

Ah yes, good point! I meant that it isn't breaking (but I know that we use 'breaking' a little differently). From what I gathered, this constraint isn't enforced anywhere though, or is it? AFAIKT the Helm chart just passes the version through.

Copy link
Member

@chaoran-chen chaoran-chen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing!

@fhennig
Copy link
Contributor Author

fhennig commented Feb 3, 2025

I want to test this on staging before merging.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
preview Triggers a deployment to argocd update_db_schema
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make pipeline version scoped at organism level rather than global
6 participants