You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current implementation filters out all samples where at least one cell type has <5 cells. When applying it to different datasets, many of them have random drop-outs on sample level. For example, the dataset below (ref) has 22 samples and 15 cell types (after filtration). Each of the cell types has at least 15 samples with >=5 cells. But given that drop-outs are random, after the package filtration we have only 5 (!) samples that have >=5 cells in all cell types. Her is the cell type x sample table:
I'm not sure if it's possible to solve properly by tuning filtration of samples. The best I could get left me with 12 cell types and 13 samples and also required setting donor_min_cells=2. And that required spending a lot of time for playing with filtration. Is there maybe any way to allow zeros (or just low numbers) for the algorithm? The issue is quite major for usability of the package: dropping half of samples really kills the signal.
If I remember right, you mentioned that the current decomposition doesn't allow NAs. But maybe it's possible to down-weight samples with low number of cells somehow? So if we set donor_min_cells=1, it would not introduce too much noise.
The text was updated successfully, but these errors were encountered:
Hi Viktor, thanks for bringing this up, as it's something I've been meaning to address. For now, I would suggest using a smaller list of cell types (probably the largest cell types), so that more donors can be included.
I like the idea of down-weighting. Perhaps we could implement something which shrinks a donor's expression of a given gene in a given cell type to the mean expression across donors (the ones with cells in that population); and the strength of the regularization could depend on the number of cells that donor has for that cell type. An alternative solution would be to set NA values for donors with 0 cells (or very few cells) and apply some imputation method such as imputePCA() (see here https://www.rdocumentation.org/packages/missMDA/versions/1.18/topics/imputePCA). This could be directly applied to the normalized pseudobulk tensor unfolded along the donor dimension (so applied to a matrix of dimension donors X (gene-cell-type combinations)).
I can make a new branch to start trying out different approaches.
Thanks for the quick reply, Jonathan! There is no urgency from my side, and for this particular part I don't want to do sophisticated preprocessing. But otherwise, your suggestions look solid. I don't know when I'll need the method next time, but if I try something, will definitely let you know.
The current implementation filters out all samples where at least one cell type has <5 cells. When applying it to different datasets, many of them have random drop-outs on sample level. For example, the dataset below (ref) has 22 samples and 15 cell types (after filtration). Each of the cell types has at least 15 samples with >=5 cells. But given that drop-outs are random, after the package filtration we have only 5 (!) samples that have >=5 cells in all cell types. Her is the cell type x sample table:
I'm not sure if it's possible to solve properly by tuning filtration of samples. The best I could get left me with 12 cell types and 13 samples and also required setting
donor_min_cells=2
. And that required spending a lot of time for playing with filtration. Is there maybe any way to allow zeros (or just low numbers) for the algorithm? The issue is quite major for usability of the package: dropping half of samples really kills the signal.If I remember right, you mentioned that the current decomposition doesn't allow NAs. But maybe it's possible to down-weight samples with low number of cells somehow? So if we set
donor_min_cells=1
, it would not introduce too much noise.The text was updated successfully, but these errors were encountered: