Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Results interpretation - overestimation? #163

Closed
giorgiatosoni opened this issue Oct 27, 2022 · 26 comments
Closed

Results interpretation - overestimation? #163

giorgiatosoni opened this issue Oct 27, 2022 · 26 comments

Comments

@giorgiatosoni
Copy link

giorgiatosoni commented Oct 27, 2022

Hi,

I would like to have some help: I'm running CellBender on a "problematic" sample (trial of new nuclei isolation protocol , resulted in lower quality). We know that many cells need to be discarded from this sample (achieved with stringent filtering for nUMI and %mt). I decided to try CellBender to see if I could get a better estimation on the keepable cell for downstream analysis, but CellBender seems to keep all the cells. I'm not sure what actually is cell and what is ambient/low quality cell from the Cell ranger output, I used the following parameters:

cellbender remove-background \
>                  --input ./raw_S2R2/raw_feature_bc_matrix.h5 \
>                  --output ./raw_S2R2/cellbender_matrix.h5 \
>                  --expected-cells 800 \
>                  --total-droplets-included 6000 \
>                  --fpr 0.01 \
>                  --epochs 150

image
Cattura

and got the attached pdf output.
s2r2.pdf

Now my questions are:

  • What does this output mean? Is this sample not suitable for CellBender? or I just used the wrong parameters?

  • Since the outputs of cellranger and CellBender are comparable in the # of cells kept, can the output of CellBender be considered cleaner because it should have also removed some background from the new count matrix, or the results of CellBender are not trustable?

I hope it is clear and thanks very much in advance!

@giorgiatosoni giorgiatosoni changed the title Results interperation - overestimation? Results interpretation - overestimation? Oct 27, 2022
@giorgiatosoni
Copy link
Author

Hi,

I would like to have some help: I'm running CellBender on a "problematic" sample (trial of new nuclei isolation protocol , resulted in lower quality). We know that many cells need to be discarded from this sample (achieved with stringent filtering for nUMI and %mt). I decided to try CellBender to see if I could get a better estimation on the keepable cell for downstream analysis, but CellBender seems to keep all the cells. I'm not sure what actually is cell and what is ambient/low quality cell from the Cell ranger output, I used the following parameters:

cellbender remove-background \
>                  --input ./raw_S2R2/raw_feature_bc_matrix.h5 \
>                  --output ./raw_S2R2/cellbender_matrix.h5 \
>                  --expected-cells 800 \
>                  --total-droplets-included 6000 \
>                  --fpr 0.01 \
>                  --epochs 150

image Cattura

and got the attached pdf output. s2r2.pdf

Now my questions are:

  • What does this output mean? Is this sample not suitable for CellBender? or I just used the wrong parameters?
  • Since the outputs of cellranger and CellBender are comparable in the # of cells kept, can the output of CellBender be considered cleaner because it should have also removed some background from the new count matrix, or the results of CellBender are not trustable?

I hope it is clear and thankls very much in advance!

Question 2 I think is answered by #134 (?)

@sjfleming
Copy link
Member

sjfleming commented Oct 28, 2022

Hi @giorgiatosoni , you're right that this is related to #134 . CellBender seems to struggle to call cells on some experiments unfortunately. I am actively working on this.

Currently, my answers to your questions would be:

  • The output means that CellBender is over-calling cells. The sample should still be suitable. It's possible that by tweaking parameters, we can get it to work better.
  • Yes, the output can still be considered cleaner in most cases I've seen. But it is probably still better to try to make a few tweaks to try to get cell calling to work appropriately.

I would suggest trying

cellbender remove-background \
    --input ./raw_S2R2/raw_feature_bc_matrix.h5 \
    --output ./raw_S2R2/cellbender_matrix.h5 \
    --expected-cells 300 \
    --total-droplets-included 4000 \
    --fpr 0.01 \
    --epochs 150

you could even try --total-droplets-included 3000 and compare. Hopefully one of those settings will get you something more reasonable! Let me know...

@giorgiatosoni
Copy link
Author

Hi @sjfleming, thank you very much for your answer!

I think with the new parameters the cell call is much more realistic: 1752 called cells vs 7998 called cells with the previous parameters.

However I see that with the new parameters the training plot looks a bit weird, what do you think?

s2r2_new parameters.pdf

@giorgiatosoni
Copy link
Author

Also, update on "Yes, the output can still be considered cleaner in most cases I've seen. But it is probably still better to try to make a few tweaks to try to get cell calling to work appropriately."
Using the initial parameters that over-call cells I don't see any differences at the counts level: the difference between the number of counts per cell pre- and post- CellBender is 0.

@sjfleming
Copy link
Member

@giorgiatosoni , great, I think your second plot looks excellent. That's exactly what I would hope to see. The training plot also looks fine to me, I don't see any problem with that.

As far as the initial run, do you mean that cellbender removed zero counts the first time?

@giorgiatosoni
Copy link
Author

@sjfleming nice thanks!!

Yes exactly, it didn't remove any counts, probably because it failed to determine what was ambient?

@sjfleming
Copy link
Member

Huh, I did not expect that... well, thanks for letting me know! I should keep that in mind in future...

@giorgiatosoni
Copy link
Author

Hi,

Coming back to the overestimation problem, I have another sample for which I tried different parameters, but all the nuclei are kept after running CellBender. The parameters tried are:

cellbender remove-background
--input raw_feature_bc_matrix.h5
--output output.h5
--cuda
--*expected-cells 2000
--total-droplets-included 40000 *

--fpr 0.01
--epochs 150

and

cellbender remove-background
--input raw_feature_bc_matrix.h5
--output output.h5
--cuda
--*expected-cells 4000
--total-droplets-included 30000 *

--fpr 0.01
--epochs 150

image

These are the resulting summaries:

output.pdf
ADN__1c5e51__S8_10X_210122_cellbender_rough.pdf

Any suggestions??
Thanks a lot :)

@sjfleming
Copy link
Member

Hmm, I have to say I'm not sure. How about trying

cellbender remove-background 
--input raw_feature_bc_matrix.h5 
--output output.h5 
--cuda 
--expected-cells 1000 
--total-droplets-included 20000
--fpr 0.01 
--epochs 150

and maybe, for good measure, add

--low-count-threshold 100

Does the log file indicate that CellBender is accurately estimating the counts in cells and the counts in empty droplets? (--low-count-threshold might help if there's a problem there... since this dataset has high ambient)

@giorgiatosoni
Copy link
Author

Thank you, I'm gonna try this! Not sure if the CellBender estimation is correct here..

cellbender:remove-background: Prior on counts in empty droplets is 17
cellbender:remove-background: Prior on counts for cells is 31234
cellbender:remove-background: Excluding barcodes with counts below 15
cellbender:remove-background: Using 4000 probable cell barcodes, plus an additional 26000 barcodes, and 69776 empty droplets.
cellbender:remove-background: Largest surely-empty droplet has 824 UMI counts.

@sjfleming
Copy link
Member

Ah yeah, it's getting the empty droplet counts wrong :(

Hopefully the --low-count-threshold 100 can fix that. Currently it's getting fooled into thinking that second plateau way out near 10 UMI counts is the empty droplets... but that's not it.

@giorgiatosoni
Copy link
Author

Oh I see.. thanks for the explanation!

@giorgiatosoni
Copy link
Author

Hi @sjfleming, getting back to this after some time..

Just wanted to let you know that --low-count-threshold 100 did the trick, the output looks much better now S8R5_output.pdf
Just to understand things a bit better: as parameters I used again --expected-cells 4000 --total-droplets-included 30000 and from the cellbender filtered_out I got 12K called cells. Is such an overestimation expected (getting 12k cells instead of the specified expected 4K) or should I play with the parameters further?

Also we are planning to use cellbender for a quite big analysis, I was wondering if you have an estimated release date for v3 :)

Thanks a lot for all your help, it's much appreciated!!

Best,
giorgia

@ccruizm
Copy link

ccruizm commented Feb 18, 2023

Also, have the same question! When will the v3 be released? :D

@sjfleming
Copy link
Member

Hi @giorgiatosoni , great! That kind of "overestimation" is expected in this case and here's why:
image

For the UMI curve above, we can see that, while there does appear to be a sharp drop-off pretty close to the expected 4k cells, there is a long tail of droplets with counts that are well above the "empty droplet plateau level" which seems to be about 1000 UMI counts in this experiment.

CellBender takes into account UMI counts as well as the gene expression profile itself, and the algorithm has apparently determined that those highlighted droplets are sufficiently different from the completely empty droplets that the posterior probability of being "non-empty" is high.

But it's worth keeping in mind: although the plot says "cell probability", it's really the probability that a droplet is not empty. The subtle difference is that you should still do cell QC after CellBender. While CellBender has identified about 12k non-empty droplets, not all of these are going to be great quality droplets you want to use in your analysis.

But the nice thing is, a lot of the time CellBender + cell QC will pick up on cell types that might have been missed out using cell calling with CellRanger + cell QC. CellBender (without additional cell QC) will include more droplets with poor quality / dying cells though, since those droplets are in fact not empty.

@sjfleming
Copy link
Member

@giorgiatosoni and @ccruizm , great question about timing for v3 release. I really hope to have it ready by March 10.

@giorgiatosoni
Copy link
Author

Thank you so much @sjfleming and looking forward to using the v3!

@wmacnair
Copy link

wmacnair commented May 5, 2023

Hi @sjfleming, thanks for this helpful discussion.

I was wondering if you could expand a bit more on how the --low-count-threshold parameter works. My guess is that cellbender does something like take the minimum of the cutoffs implied by --low-count-threshold and --total-droplets-included, and this determines the set of barcodes that cellbender "sees". Is this right? Or is it more subtle than that?

Thanks!
Will

@sjfleming
Copy link
Member

Update on the timing for v3 release: follow this PR
#189

@sjfleming
Copy link
Member

Hi @wmacnair ,
The --low-count-threshold parameter is used in the process of trying to figure out (1) what the prior on cell UMI counts should be, (2) what the prior on empty droplet UMI counts should be, and (3, optionally) how many expected cells and total droplets to use, if the user did not input these values.

Unfortunately it's a little more subtle than that. There are a number of heuristics currently used to figure out (1), (2), and (3), and they all center around looking at the full UMI curve to guess the values. --low-count-threshold comes into the picture because if you include hundreds of thousands of 1-count droplets, it messed up my heuristics... so I wanted a way to completely ignore droplets that I thought were true garbage during this process. Not cells and not empties: these droplets are things more like cell barcode sequencing errors and the like. The intention is to get rid of droplets (during the heuristic estimation process) that are "past the empty droplet plateau". Usually it is sufficient to get rid of droplets below 15 UMI counts.

But in datasets that are very clean, and in particular if the real "empty droplet plateau" is like 10 or 15 UMI counts per empty droplet, then this default value becomes a problem, and it should be changed to more like 5. The rule of thumb is that this value should be below the empty droplet plateau, and just cut off the long long tail of very very low count droplets.

@wmacnair
Copy link

Hi @sjfleming

Thanks for this explanation, that makes sense. Yes developing reliable heuristics to parse out the UMI curve is hard! Especially in the context of deeply-sequenced and "dirty" clinical samples, where the empties plateau can have UMI counts in the hundreds or even thousands, the knee going from cells to empties is not sharp at all, and even the "errors" plateau might be at 100...

I think something relatively simple to add that could be helpful would be to include an additional plot in your diagnostics, that showed the priors learned by cellbender. That way it would be more obvious to users whether or not cellbender had matched expectations on where the cells and empties were. Low priority, but would be helpful I think :)

Thanks again!
Will

@wmacnair
Copy link

On a slightly different topic, as I wasn't sure where to put it - do you know this paper?
Neuronal ambient RNA contamination causes misinterpreted and masked cell types in brain single-nuclei datasets

An interesting claim here is that they see two distinct types of ambient contamination, one cytoplasmic, and one nuclear. This fits with my own observations of our data, i.e. two distinct profiles of ambient in some samples.

So perhaps v4 could include two distinct ambient profiles? ;)

Will

@sjfleming
Copy link
Member

Thanks for the suggestion @wmacnair , I think a plot that shows the cellbender priors would be a nice addition!

I have come across that paper yeah, but I didn't see the part about two distinct types of ambient. That's an interesting observation. I'll have to read more carefully to understand exactly what they mean by that!

@wmacnair
Copy link

I think the idea is that you get two different biological processes leading to different ambient profiles. One is from the cytoplasm, and I think has a lot of MT- gene expression, while the other is from nuclei that have lysed. I think this could explain some cases where cellbender doesn't behave as expected, e.g. not fully removing MT- expression. (Although here I also wonder whether sometimes there are mitochondria stuck to nuclei 😅.)

It feels like in principle this could be ok to implement within your current framework, although probably a decent amount of work. Something like a k parameter for number of ambient profiles, a clustering step on the "definitely ambient" cells to find those k ambient profiles, then amend the model so that each cell has a mixture across those ambient profiles.

@sjfleming
Copy link
Member

It's an interesting though, but yeah it would be a different noise model. My current thinking would still be that even if you have two different biological processes leading to different ambient profiles... like your example of nuclear ambient and cytoplasmic ambient, which is a good example... I'd still expect those ambient RNAs to mix completely once they are cell-free, and to form one combined "ambient RNA" profile from which the empty droplets sample their counts. Do you agree?

But I do agree with you... there is one thing that CellBender does not model currently, which can happen... things like mitochondria sticking to a nucleus in snRNA-seq. I have seen pictures of this kind of thing under the microscope. Bits of organelles, etc. "Debris". Because that's another level of complexity, where some nuclei might have "junk" attached and others completely do not, CellBender doesn't try to model it. That's a real hard problem at that point... because being able to tell whether something is "an added, unwanted mitochondrion or organelle" versus true biological differences in gene expression in the nucleus is going to be very tough. Right now the thinking is that those nuclei which have the "extra cytoplasmic stuff" should get QCed out (by some downstream tool after CellBender) due to their high MT fraction, or high exonic read fraction.

@wmacnair
Copy link

wmacnair commented Jun 5, 2023

Regarding the "sticky" things - e.g. mitochondria, organelles, ER - yes I also strongly think that's the case. I actually did some analysis of some of our snRNAseq data, and for some samples the two assumptions "MT- gene expression is purely ambient" and "the ambient profile is completely captured by the empties" can't both be true. As in, if you remove the maximum amount of ambient consistent with true expression being >= 0, there is still some MT- left; and if you use MT- levels to determine how much ambient to remove, this implies true expression < 0...

I slightly disagree on how to address this, and I don't think we necessarily want to exclude these nuclei via QC. If they're only contaminated by mitochondria, what is the rationale to exclude them? I've been excluding the MT- genes and keeping the nuclei, partly because if I don't do that we lose even more of our data to QC! And I do think that QC is still not well understood, in particular that we have pretty arbitrary rationales for excluding cells, without data to link them to true biology, e.g. what number of UMIs corresponds to an intact cell? See thread here.

Regarding whether "ambient RNA" is one nicely mixed profile, maybe...?? I think I've been surprised enough by biology to really not be sure what the truth is! It could also be the case that what contaminates a given droplet is "lumpy", so e.g. x * well-mixed ambient + n * organelle chunks. Or maybe better phrased: contamination by ambient RNA could be well-mixed, but actually there are simultaneous multiple contamination processes... 😅

What this all points to is the need for a gold standard dataset to allow proper benchmarking of methods like CellBender. Something like: make a pure prep of nuclei (from one human donor); add ambient RNA in differing proportions (from a different human donor with a different genetic background); use SNPs to determine whether reads are nuclei/ambient. (Or you might be able to use some other labelling approach.)

Cheers
Will

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants