Add sample and library metadata to colData for merged objects #630

allyhawkins · 2023-12-20T19:08:20Z

Closes #627

This PR updates the merged objects to include any sample metadata and some library metadata in the colData rather than in the metadata(sce).

For all non-multiplexed merged SCE's, I use the existing scpcaTools::metadata_to_coldata() function to add the sample metadata to the colData. Then I remove the sample_metadata from the metadata in the object so it's only present in the colData. Because we are removing it, I am able to add a check for if the sample_metadata is present in the metadata prior to conversion to AnnData so we don't accidentally try and add it to the colData twice.
I chose not to do this for multiplexed objects because we would have to merge by both the library_id and a sample_id column. However, we demultiplex using multiple methods, so I don't want us to decide which method to use to assign samples. I think it would be less confusing to just include the sample_metadata as part of the metadata, like we do for all other single libraries.
To accommodate this, I added an is_multiplexed argument to the script and then pass that in from Nextflow in a similar way to the has_adt. This means that if a project contains any multiplexed libraries then no sample metadata will be added to the colData, which I think is what we want to do.
Related to the point above, I filtered the output from merging sces to not convert any multiplexed libraries to AnnData.
For the library metadata, the only thing I thought was important to be easily accessible was the tech_version and assay_ontology_term_id. So I explicitly grab those from metadata$library_metadata and join with the colData. Let me know if there are other pieces of library metadata that should be included. See https://github.com/AlexsLemonade/scpca-docs/blob/main/docs/sce_file_contents.md#experiment-metadata for a full list.
Edit: I also chose to add seq_unit because I want to be able to display it in the merged report.

jashapiro

This looks good overall. I had a few fairly small suggestions, with some being just kind of safety changes, and one preference on variable naming.

The one thing I think may be missing is that the merge report process inputs will have to be changed. I think it sneaks in here because we don't currently test this workflow as part of the stub checks. We should probably add that to the GHA checks before we go too much further (hopefully now is a good time).

bin/merge_sces.R

merge.nf

jashapiro · 2024-01-02T18:44:51Z

merge.nf

  output:
-    tuple val(merge_group_id), val(has_adt), path(merged_sce_file)
+    tuple val(merge_group_id), val(has_adt), val(is_multiplexed), path(merged_sce_file)


Just a thought that since is_multiplexed is going to be dropped later, it might make sense to put it at the end? Then you can use .take() to simplify the removal to just drop the last element.

jashapiro · 2024-01-02T18:45:43Z

merge.nf

@@ -162,5 +169,13 @@ workflow {
    generate_merge_report(merge_sce.out, file(merge_template))


You will need to update the generate_merge_report process to expect the extra value in the tuple, right?

Co-authored-by: Joshua Shapiro <[email protected]>

allyhawkins · 2024-01-02T20:57:47Z

Thanks for the suggestions @jashapiro. I updated the variable name and use multiplexed throughout instead of is_multiplexed. I also updated the order of the inputs and outputs and use .take() instead of having another map command to remove one element.

I did try to add a stub run of the merge workflow to the GHA for checking the stub, however, we don't have the input files available on the repo. And I can't commit them unless we update our config for pre-commit to allow .rds files. Do we want to do that? Or do you have another suggestion on how to handle that?

jashapiro · 2024-01-02T21:16:26Z

And I can't commit them unless we update our config for pre-commit to allow .rds files. Do we want to do that? Or do you have another suggestion on how to handle that?

You can skip the precommit hooks with --no-verify, which is I think what we would want to do for a case like this:

git commit -m 'add stub rds files' --no-verify

jashapiro

A couple quick things...

.github/workflows/nextflow-stub-check.yaml

merge.nf

Co-authored-by: Joshua Shapiro <[email protected]>

This reverts commit 77e7dd0.

allyhawkins · 2024-01-02T21:55:50Z

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

test/.gitignore

Co-authored-by: Joshua Shapiro <[email protected]>

jashapiro · 2024-01-02T22:12:52Z

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

I think if we can get away with just including the processed.rds files in the stub output directory that is probably the way to go.

allyhawkins · 2024-01-02T22:37:30Z

Okay the tests are finally working as they should. I added a new stub library for the first project so that we have two input files. I also had to actually add the output for that project. I added all of the output, but do we just want to include the processed.rds file?

I think if we can get away with just including the processed.rds files in the stub output directory that is probably the way to go.

Okay, I removed all the extra files so we are just saving the processed.rds files in the repo.

jashapiro

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

jashapiro · 2024-01-03T16:26:15Z

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

Or maybe it means bad things are happening... it looks like it did run the STUBR16 process all the way through. Which seems less than ideal. (It would probably fail with real files when trying to run alevin, but that's still not great)

allyhawkins · 2024-01-03T16:26:32Z

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

I honestly didn't even think to add them. Should I do that? The workflow seems to be running without them... that seems kind of concerning...

jashapiro · 2024-01-03T16:31:24Z

LGTM, but I realized there is one question I have based on the fact that you didn't add the "fastq" files for the STUBR16. I think this may mean we are actually well off for #639, but I was still curious what was happening here.

I honestly didn't even think to add them. Should I do that? The workflow seems to be running without them... that seems kind of concerning...

I think it works because the files are never actually used in the stub, and the file() call doesn't check that the files exist by default. So I think we are actually okay leaving it as is, and we can use this absence when implementing #639.

allyhawkins and others added 4 commits December 20, 2023 10:15

add library and sample metadata to coldata

06f0c4b

only add sample metadata if present

380795c

add is_multiplexed

ac0b178

use correct sce variable

3827250

allyhawkins requested a review from jashapiro December 20, 2023 19:08

add seq unit

27ba811

allyhawkins mentioned this pull request Dec 21, 2023

Add tables to merge report #632

Merged

Merge branch 'development' into allyhawkins/metadata-to-coldata

83906e8

allyhawkins mentioned this pull request Jan 2, 2024

Add merge report UMAPs #636

Merged

jashapiro reviewed Jan 2, 2024

View reviewed changes

allyhawkins and others added 5 commits January 2, 2024 14:25

Apply suggestions from code review

8d951e2

Co-authored-by: Joshua Shapiro <[email protected]>

is_multiplexed to multiplexed

a78a509

update order of inputs and use .take

d8f28fb

add merging to stub check

a486f9a

specify project

a93ba11

jashapiro reviewed Jan 2, 2024

View reviewed changes

.github/workflows/nextflow-stub-check.yaml Outdated Show resolved Hide resolved

merge.nf Outdated Show resolved Hide resolved

allyhawkins and others added 7 commits January 2, 2024 15:27

test making output available for check

77e7dd0

count correctly

13b8381

Co-authored-by: Joshua Shapiro <[email protected]>

cat merge log

78ebc86

Co-authored-by: Joshua Shapiro <[email protected]>

Revert "test making output available for check"

d684ab7

This reverts commit 77e7dd0.

use project 1 and add file

9ef62cf

thanks testing for catching errors

cf8785c

more missing merged group ids and ignore merged output

97e8202

jashapiro reviewed Jan 2, 2024

View reviewed changes

test/.gitignore Outdated Show resolved Hide resolved

revert gitignore changes

e37f898

Co-authored-by: Joshua Shapiro <[email protected]>

remove all files but processed

af5f591

allyhawkins requested a review from jashapiro January 2, 2024 22:37

jashapiro approved these changes Jan 3, 2024

View reviewed changes

jashapiro mentioned this pull request Jan 3, 2024

Mapping workflow should fail if mapping files are missing #639

Closed

allyhawkins merged commit 331e34b into development Jan 3, 2024
3 checks passed

allyhawkins deleted the allyhawkins/metadata-to-coldata branch January 3, 2024 16:37

allyhawkins mentioned this pull request Jan 3, 2024

Add sample and library metadata to colData for merged SCE objects #627

Closed

sjspielman mentioned this pull request Jan 11, 2024

Fix merge script join bug #652

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add sample and library metadata to colData for merged objects #630

Add sample and library metadata to colData for merged objects #630

allyhawkins commented Dec 20, 2023 •

edited

Loading

jashapiro left a comment

jashapiro Jan 2, 2024

jashapiro Jan 2, 2024

allyhawkins commented Jan 2, 2024

jashapiro commented Jan 2, 2024

jashapiro left a comment

allyhawkins commented Jan 2, 2024

jashapiro commented Jan 2, 2024 •

edited

Loading

allyhawkins commented Jan 2, 2024

jashapiro left a comment

jashapiro commented Jan 3, 2024

allyhawkins commented Jan 3, 2024

jashapiro commented Jan 3, 2024 •

edited

Loading

		@@ -162,5 +169,13 @@ workflow {
		generate_merge_report(merge_sce.out, file(merge_template))

Add sample and library metadata to colData for merged objects #630

Add sample and library metadata to colData for merged objects #630

Conversation

allyhawkins commented Dec 20, 2023 • edited Loading

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Jan 2, 2024

Choose a reason for hiding this comment

jashapiro Jan 2, 2024

Choose a reason for hiding this comment

allyhawkins commented Jan 2, 2024

jashapiro commented Jan 2, 2024

jashapiro left a comment

Choose a reason for hiding this comment

allyhawkins commented Jan 2, 2024

jashapiro commented Jan 2, 2024 • edited Loading

allyhawkins commented Jan 2, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro commented Jan 3, 2024

allyhawkins commented Jan 3, 2024

jashapiro commented Jan 3, 2024 • edited Loading

allyhawkins commented Dec 20, 2023 •

edited

Loading

jashapiro commented Jan 2, 2024 •

edited

Loading

jashapiro commented Jan 3, 2024 •

edited

Loading