The algorithm filters samples too severely #9

VPetukhov · 2021-10-20T09:55:46Z

The current implementation filters out all samples where at least one cell type has <5 cells. When applying it to different datasets, many of them have random drop-outs on sample level. For example, the dataset below (ref) has 22 samples and 15 cell types (after filtration). Each of the cell types has at least 15 samples with >=5 cells. But given that drop-outs are random, after the package filtration we have only 5 (!) samples that have >=5 cells in all cell types. Her is the cell type x sample table:

.                               S1   S2   S3   S4   S5   S6   S7   S8   S9  S10  S11  S12  S13  S14  S15  S16  S17  S18  S19  S20  S21  S22
  AT2                         1219 1044   30    2   36   29   19   61  328  504 1120   42    4    4   98  109 1838  137  395   99  450   50
  cDCs                          22   12   27   43  161   42   83  145    2   19   12   45    9    1    2    6  154  119   85   63   73   87
  Ciliated                       8   34  143  106   89  122  120  313  449  892  206   32   14  294   47  252 1269 4210 4289  186  422  479
  Endothelial Cells             62  187  553   50  229   89  471  406  314   49  348  707   93  448    0   94 1410  190  274  167   43   73
  Lymphatic Endothelial Cells    8   19  141    4    7    3   23   14   15  143   47    8    5    9    0    1  478   25    3    6   10   10
  Macrophages                  292  748  958  431 2122  741 2422 1799  791  169 2073 4492   20  538 1923  380 5372 1011  971  322  290  646
  Mast Cells                     6    1  179   15   12    8   75  142    5   14   23   10    0    4   12   11    0   65   16    1    3   12
  Monocytes                    437  349  863  243   18    4 1042  438   39   34  918  100   89   13   12    9   40  100  108   79   29    3
  MUC5B+                         3    1    1   15   41   35   32   53  116   26   60   16    0   22   36    9  578  259  457   30   88  343
  Myofibroblasts                 2   14   11   53  117   47   94  104   36   56   30   41    5    1    0   26  324   10   20    4   11    3
  Proliferating Macrophages      3    7   11    4   78   33   18   25   39   11    9   98    0   11   11    9  497   51   22   24    9   43
  SCGB3A2+                       2   20    7   46   61  154  207  180    8   40    9    4    2    4    1   19  641  439  663   17   56  341
  SCGB3A2+ SCGB1A1+              1   12    0   30   18   34   31   51   34  114   32   11    3   12    3   43   58  161  150   33  210   35
  Smooth Muscle Cells            1   24   14   34   32   44   28   72   15    2   36   86    8    2    0   29  290   14   13    4    2    5
  T Cells                        9    1  221    2   14    2 1259  143   30  241 1584   40  474    2   20    4  243  529  206  386   29   12

I'm not sure if it's possible to solve properly by tuning filtration of samples. The best I could get left me with 12 cell types and 13 samples and also required setting donor_min_cells=2. And that required spending a lot of time for playing with filtration. Is there maybe any way to allow zeros (or just low numbers) for the algorithm? The issue is quite major for usability of the package: dropping half of samples really kills the signal.
If I remember right, you mentioned that the current decomposition doesn't allow NAs. But maybe it's possible to down-weight samples with low number of cells somehow? So if we set donor_min_cells=1, it would not introduce too much noise.

The text was updated successfully, but these errors were encountered:

j-mitchel · 2021-10-20T13:04:31Z

Hi Viktor, thanks for bringing this up, as it's something I've been meaning to address. For now, I would suggest using a smaller list of cell types (probably the largest cell types), so that more donors can be included.

I like the idea of down-weighting. Perhaps we could implement something which shrinks a donor's expression of a given gene in a given cell type to the mean expression across donors (the ones with cells in that population); and the strength of the regularization could depend on the number of cells that donor has for that cell type. An alternative solution would be to set NA values for donors with 0 cells (or very few cells) and apply some imputation method such as imputePCA() (see here https://www.rdocumentation.org/packages/missMDA/versions/1.18/topics/imputePCA). This could be directly applied to the normalized pseudobulk tensor unfolded along the donor dimension (so applied to a matrix of dimension donors X (gene-cell-type combinations)).

I can make a new branch to start trying out different approaches.

VPetukhov · 2021-10-20T13:41:30Z

Thanks for the quick reply, Jonathan! There is no urgency from my side, and for this particular part I don't want to do sophisticated preprocessing. But otherwise, your suggestions look solid. I don't know when I'll need the method next time, but if I try something, will definitely let you know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The algorithm filters samples too severely #9

The algorithm filters samples too severely #9

VPetukhov commented Oct 20, 2021

j-mitchel commented Oct 20, 2021

VPetukhov commented Oct 20, 2021

The algorithm filters samples too severely #9

The algorithm filters samples too severely #9

Comments

VPetukhov commented Oct 20, 2021

j-mitchel commented Oct 20, 2021

VPetukhov commented Oct 20, 2021