-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add sample and library metadata to colData for merged objects #630
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good overall. I had a few fairly small suggestions, with some being just kind of safety changes, and one preference on variable naming.
The one thing I think may be missing is that the merge report process inputs will have to be changed. I think it sneaks in here because we don't currently test this workflow as part of the stub checks. We should probably add that to the GHA checks before we go too much further (hopefully now is a good time).
merge.nf
Outdated
output: | ||
tuple val(merge_group_id), val(has_adt), path(merged_sce_file) | ||
tuple val(merge_group_id), val(has_adt), val(is_multiplexed), path(merged_sce_file) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a thought that since is_multiplexed
is going to be dropped later, it might make sense to put it at the end? Then you can use .take()
to simplify the removal to just drop the last element.
@@ -162,5 +169,13 @@ workflow { | |||
generate_merge_report(merge_sce.out, file(merge_template)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You will need to update the generate_merge_report
process to expect the extra value in the tuple, right?
Thanks for the suggestions @jashapiro. I updated the variable name and use I did try to add a stub run of the merge workflow to the GHA for checking the stub, however, we don't have the input files available on the repo. And I can't commit them unless we update our config for |
You can skip the precommit hooks with
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple quick things...
Co-authored-by: Joshua Shapiro <[email protected]>
Co-authored-by: Joshua Shapiro <[email protected]>
This reverts commit 77e7dd0.
Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the |
Co-authored-by: Joshua Shapiro <[email protected]>
I think if we can get away with just including the |
Okay, I removed all the extra files so we are just saving the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.
Or maybe it means bad things are happening... it looks like it did run the STUBR16 process all the way through. Which seems less than ideal. (It would probably fail with real files when trying to run alevin, but that's still not great) |
I honestly didn't even think to add them. Should I do that? The workflow seems to be running without them... that seems kind of concerning... |
I think it works because the files are never actually used in the stub, and the |
Closes #627
This PR updates the merged objects to include any sample metadata and some library metadata in the
colData
rather than in themetadata(sce)
.scpcaTools::metadata_to_coldata()
function to add the sample metadata to thecolData
. Then I remove thesample_metadata
from the metadata in the object so it's only present in thecolData
. Because we are removing it, I am able to add a check for if thesample_metadata
is present in the metadata prior to conversion to AnnData so we don't accidentally try and add it to thecolData
twice.library_id
and asample_id
column. However, we demultiplex using multiple methods, so I don't want us to decide which method to use to assign samples. I think it would be less confusing to just include thesample_metadata
as part of the metadata, like we do for all other single libraries.is_multiplexed
argument to the script and then pass that in from Nextflow in a similar way to thehas_adt
. This means that if a project contains any multiplexed libraries then no sample metadata will be added to thecolData
, which I think is what we want to do.AnnData
.tech_version
andassay_ontology_term_id
. So I explicitly grab those frommetadata$library_metadata
and join with thecolData
. Let me know if there are other pieces of library metadata that should be included. See https://github.com/AlexsLemonade/scpca-docs/blob/main/docs/sce_file_contents.md#experiment-metadata for a full list.Edit: I also chose to add
seq_unit
because I want to be able to display it in the merged report.