Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sample and library metadata to colData for merged objects #630

Merged
merged 20 commits into from
Jan 3, 2024

Conversation

allyhawkins
Copy link
Member

@allyhawkins allyhawkins commented Dec 20, 2023

Closes #627

This PR updates the merged objects to include any sample metadata and some library metadata in the colData rather than in the metadata(sce).

  • For all non-multiplexed merged SCE's, I use the existing scpcaTools::metadata_to_coldata() function to add the sample metadata to the colData. Then I remove the sample_metadata from the metadata in the object so it's only present in the colData. Because we are removing it, I am able to add a check for if the sample_metadata is present in the metadata prior to conversion to AnnData so we don't accidentally try and add it to the colData twice.
  • I chose not to do this for multiplexed objects because we would have to merge by both the library_id and a sample_id column. However, we demultiplex using multiple methods, so I don't want us to decide which method to use to assign samples. I think it would be less confusing to just include the sample_metadata as part of the metadata, like we do for all other single libraries.
  • To accommodate this, I added an is_multiplexed argument to the script and then pass that in from Nextflow in a similar way to the has_adt. This means that if a project contains any multiplexed libraries then no sample metadata will be added to the colData, which I think is what we want to do.
  • Related to the point above, I filtered the output from merging sces to not convert any multiplexed libraries to AnnData.
  • For the library metadata, the only thing I thought was important to be easily accessible was the tech_version and assay_ontology_term_id. So I explicitly grab those from metadata$library_metadata and join with the colData. Let me know if there are other pieces of library metadata that should be included. See https://github.com/AlexsLemonade/scpca-docs/blob/main/docs/sce_file_contents.md#experiment-metadata for a full list.
    Edit: I also chose to add seq_unit because I want to be able to display it in the merged report.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good overall. I had a few fairly small suggestions, with some being just kind of safety changes, and one preference on variable naming.

The one thing I think may be missing is that the merge report process inputs will have to be changed. I think it sneaks in here because we don't currently test this workflow as part of the stub checks. We should probably add that to the GHA checks before we go too much further (hopefully now is a good time).

merge.nf Outdated
output:
tuple val(merge_group_id), val(has_adt), path(merged_sce_file)
tuple val(merge_group_id), val(has_adt), val(is_multiplexed), path(merged_sce_file)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a thought that since is_multiplexed is going to be dropped later, it might make sense to put it at the end? Then you can use .take() to simplify the removal to just drop the last element.

@@ -162,5 +169,13 @@ workflow {
generate_merge_report(merge_sce.out, file(merge_template))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will need to update the generate_merge_report process to expect the extra value in the tuple, right?

@allyhawkins
Copy link
Member Author

Thanks for the suggestions @jashapiro. I updated the variable name and use multiplexed throughout instead of is_multiplexed. I also updated the order of the inputs and outputs and use .take() instead of having another map command to remove one element.

I did try to add a stub run of the merge workflow to the GHA for checking the stub, however, we don't have the input files available on the repo. And I can't commit them unless we update our config for pre-commit to allow .rds files. Do we want to do that? Or do you have another suggestion on how to handle that?

@jashapiro
Copy link
Member

And I can't commit them unless we update our config for pre-commit to allow .rds files. Do we want to do that? Or do you have another suggestion on how to handle that?

You can skip the precommit hooks with --no-verify, which is I think what we would want to do for a case like this:

git commit -m 'add stub rds files' --no-verify

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple quick things...

@allyhawkins
Copy link
Member Author

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

Co-authored-by: Joshua Shapiro <[email protected]>
@jashapiro
Copy link
Member

jashapiro commented Jan 2, 2024

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

I think if we can get away with just including the processed.rds files in the stub output directory that is probably the way to go.

@allyhawkins
Copy link
Member Author

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

I think if we can get away with just including the processed.rds files in the stub output directory that is probably the way to go.

Okay, I removed all the extra files so we are just saving the processed.rds files in the repo.

@allyhawkins allyhawkins requested a review from jashapiro January 2, 2024 22:37
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

@jashapiro
Copy link
Member

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

Or maybe it means bad things are happening... it looks like it did run the STUBR16 process all the way through. Which seems less than ideal. (It would probably fail with real files when trying to run alevin, but that's still not great)

@allyhawkins
Copy link
Member Author

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

I honestly didn't even think to add them. Should I do that? The workflow seems to be running without them... that seems kind of concerning...

@jashapiro
Copy link
Member

jashapiro commented Jan 3, 2024

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

I honestly didn't even think to add them. Should I do that? The workflow seems to be running without them... that seems kind of concerning...

I think it works because the files are never actually used in the stub, and the file() call doesn't check that the files exist by default. So I think we are actually okay leaving it as is, and we can use this absence when implementing #639.

@allyhawkins allyhawkins merged commit 331e34b into development Jan 3, 2024
3 checks passed
@allyhawkins allyhawkins deleted the allyhawkins/metadata-to-coldata branch January 3, 2024 16:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants