The input for the CAVACHON model needs to be provided as a list of config files. An example of the minimum specification of the config files is decribed as follows:
io:
datadir: ${DATASET DIRECTORY}
modalities:
- name: RNA Modality
type: RNA
h5ad: ${RNA H5AD File}
- name: ATAC Modality
type: ATAC
h5ad: ${ATAC H5AD File}
model:
name: CAVACHON
components:
- name: ATAC Component
modalities:
- name: ATAC
- name: RNA Component
conditioned_on_z_hat:
- ATAC Component
modalities:
- name: RNA
Some other example templates can be found in sample_data/config_templates
. To use the template, simply replace ${VARIABLE}
with the custom values.
The config sould be prepared in a hierarchical structure using YAML format.
io
: Inputs and Outputs.analysis
: Analysisclustering
: list of Clusteringconditional_attribution_scores
: list of [Conditional Attribution Scores] (#conditional-attribution-scores)differential_analysis
: list of Differential Analysis
modalities
: list of Modalities.filters
: list of Filters.
samples
: Samples (optional).modalities
: list of Modality Files.
model
: Modeldataset
: Datasettraining
: Trainingoptimizer
: Optimizer
components
: list of Componentsmodalities
: list of Modalities (in Component).
The configs for inputs and outputs are specified under the field io
:
checkpointdir
:- required:
False
. - defaults:
./
- type:
str
- description: the directory for the pretrained checkpoint and the save model weights.
- required:
datadir
:- required:
False
. - defaults:
./
- type:
str
- description: the directory of the input datasets.
- required:
outdir
:- required:
False
. - defaults:
./
- type:
str
- description: the output directory.
- required:
The configs for analysis and visualization are specified under the field analysis
:
clustering
:- required:
False
. - type:
List[AnalysisClusteringConfig]
- description: the config for clustering. See Clustering for more details.
- required:
differential_analysis
:- required:
False
. - type:
List[AnalysisDifferentialAnalysisConfig]
- description: the config for differential analysis. See Differential Analysis for more details.
- required:
embedding_methods
:- required:
False
. - defaults:
[]
- type:
List[str]
- description: the embedding methods used for the visualization of latent representation. Should be a list of elements
'pca'
,'umap'
or'tsne'
.
- required:
annotation_colnames
:- required:
False
. - defaults:
[]
- type:
List[str]
- description: the annotation of cells used to group and color the embeddings and attribution scores.
- required:
conditional_attribution_scores
:- required:
False
. - defaults:
[]
- type: the config for the attribution scores. See Conditional Attribution Scores for more details.
- required:
The config for clustering.
modality
:- required:
True
- type:
str
- description: which modality of the outputs of the component to used.
- required:
component
:- required:
True
- type:
str
- description: the outputs of which component to used.
- required:
The config for clustering.
modality
:- required:
True
- type:
str
- description: which modality of the outputs of the component to used.
- required:
component
:- required:
True
- type:
str
- description: the outputs of which component to used.
- required:
The config for conditional attribution scores
modality
:- required:
True
- type:
str
- description: which modality of the outputs of the component to used.
- required:
component
:- required:
True
- type:
str
- description: the outputs of which component to used.
- required:
with_respect_to
:- required:
True
- type:
str
- description: compute integrated gradietn with respect to the latent representation of which component.
- required:
The configs for modalities (or data views) are specified under the field modalities
. This is used to specified the data distribution and type of the modalities. See also Filters.
name
:- required:
False
. - defaults:
modality/{i:02d}
- type:
str
- description: the name of the modality.
- required:
type
:- required:
True
. - type:
str
- description: the type of the modality. Currently supports
'atac'
and'rna'
.
- required:
dist
:- required:
False
. - type:
str
- defaults:
'IndependentBernoulli'
fortype:atac
.'IndependentZeroInflatedNegativeBinomial'
fortype:rna
.
- description: the data distribution of the modality. Currently supports
'IndependentBernoulli'
and'IndependentZeroInflatedNegativeBinomial'
(seecavachon/distributions
for more details).
- required:
h5ad
:- required:
False
. - type:
str
. - description: the
h5ad
file name corresponding to the modality in directoryio/datadir
(see Inputs and Outputs). Alternatively, the data can be loaded withmtx
,features
andbarcodes
files specified in Samples. Note thatsamples
configs will be ignored for the modality if provided withh5ad
.
- required:
filters
:- required:
False
. - type:
List[FilterConfig]
- defaults:
List[]
- description: see Filters and
cavachon/config/FilterConfig.py
for more details.
- required:
batch_effect_colnames
:- required:
False
- type:
List[str]
- defaults:
List[]
- description: the column names of the batch effects that needs to be corrected.
- required:
The filter applied to each modality. Should be provided as a list of FilterConfig
specification. The filtering steps will be executed sequentially based on the provided order in the list. The config should be put under modalities -> [one of the modality config] -> filters
. The FilterConfig specification is described as follows:
step
:- required:
True
- type:
str
- description: type of the filtering steps. Currently supports
FilterCells
,FilterGenes
, andFilterQC
. (seecavachon/filter/
for more details)
- required:
**kwargs
:- description: please replace
kwargs
with the arguments passed toscanpy.pp.filter_cells
(forFilterCells
),scanpy.pp.filter_genes
(forFilterGenes
). ForFilterQC
, please see the following example.
- description: please replace
modalities:
- name: ${Modality}
...
filters:
- step: FilterQC
qc_vars:
- ERCC
- MT
filter_threshold:
- field: n_genes_by_counts
operator: ge
threshold: 500
- field: pct_counts_ERCC
operator: le
threshold: 0.2
- field: pct_counts_MT
operator: le
threshold: 0.2
- step: FilterGenes
min_counts: 25
- step: FilterGenes
min_cells: 10
- step: FilterCells
min_counts: 5
The configs for the samples (or experiments) files are specified under the field samples
. Note that one samples can have multiple modalities (e.g. from single-cell multi-omics technology), the files of every samples will be merged into multiple modalities. See also Modality Files, Matrix File and Feature File.
name
:- required:
False
. - defaults:
sample/{i:02d}
- type:
str
- description: the name of the sample
- required:
modalities
:- required:
True
. - type:
List[ModalityFileConfig]
. - description: see Modality Files for more details.
- required:
The configs for the files of a modality from one sample. The config should be put under samples -> modalities
.
name
:- required:
True
. - type:
str
- description: the name of the corresponding modality. Must match one of the name specified in Modalities.
- required:
matrix
:- required:
True
. - type:
ModalityFileMatrixConfig
. - description: the config of matrix file corresponding to the modality in directory io/datadir (see Matrix File and Inputs and Outputs).
- required:
barcodes
:- required:
True
. - type:
ModalityFileFeatureConfig
. - description: the config of barcodes file (for the anchor indices) corresponding to the modality in directory io/datadir (see Feauture File and Inputs and Outputs).
- required:
features
:- required:
True
. - type:
ModalityFileFeatureConfig
. - description: the config of features file (e.g. gene annotations) corresponding to the modality in directory io/datadir (see Feauture File and Inputs and Outputs).
- required:
The configs for the matrix file. Should be put under samples -> modalities -> matrix
.
filename
:- required:
True
. - type:
str
. - description: the matrix file corresponding to the modality in directory
io -> datadir
(see Inputs and Outputs).
- required:
transpose
:- required:
False
. - defaults:
False
- type:
bool
. - description: if the matrix is transposed (the matrix is transposed if vars as rows, obs as cols).
- required:
The configs for the matrix file. Should be put under samples -> modalities -> features
and samples -> modalities -> barcodes
.
filename
:- required:
True
. - type:
str
. - description: the features file (e.g. gene annotations) or the barcodes file (for the anchor indices) corresponding to the modality in directory
io -> datadir
(see Inputs and Outputs).
- required:
has_headers
:- required:
False
- defaults:
False
- type:
bool
- description: whether or not the
features
orbarcodes
file have headers.
- required:
colnames
:- required:
False
- type:
List[str]
. - description: the column names of the
features
orbarcodes
files (ifhas_headers=False
)
- required:
The configs for the model are specified under the field model
. See also Components, Modalities (in Component), Training, Optimizer and Dataset.
name
:- required:
False
. - defaults:
CAVACHON
- type:
str
- description: the name of the model.
- required:
load_weights
:- required:
True
. - type:
bool
- description: whether or not to load the pretrained weights. If
True
, the checkpoint of the pretrained incheckpoiontdir/model_name
will be load to the Model. See config for IO.
- required:
save_weights
:- required:
False
. - type:
bool
- description: whether or not to save the weights. If
True
, the weights will be save tocheckpoiontdir/model_name
. See config for IO.
- required:
components
:- required:
True
. - type:
List[ComponentConfig]
- description: see Components and
cavachon/config/ComponentConfig.py
for more details.
- required:
training
:- required:
False
. - type:
TrainingConfig
- description: see Training and
cavachon/config/TrainingConfig.py
for more details.
- required:
dataset
:- required:
False
. - type:
DastasetConfig
- description: see Dataset and
cavachon/config/DatasetConfig.py
for more details.
- required:
The configs for the components in the model. See also Modalities (in Component).
name
:- required:
False
. - defaults:
component/{i:02d}
- type:
str
- description: the name of the component.
- required:
n_encoder_layers
:- required:
False
. - defaults:
3
- type:
int
. - description: the number of hidden layers used in the encoder neural network.
- required:
n_latent_dims
:- required:
False
. - defaults:
5
. - type:
int
. - description: the dimensionality of the latent space.
- required:
n_latent_priors
:- required:
False
. - defaults:
n_latent_dims * 2 + 1
- type:
int
. - description: the number of components of Gaussian-mixture priors used to compute KL-divergence and perform online clustering.
- required:
n_progressive_epochs
:- required:
False
. - defaults:
500
. - type:
int
. - description: number of progressive epochs used during the training process. The weight of the data likelihood will be scaled linearly with
epoch/n_progressive_epochs
.
- required:
conditioned_on_z
:- required:
False
. - defaults:
List[]
- type:
List[str]
- description: the provided string in the list needs to be the name that matched to one of the specified Components. The current component will be conditionally independent with the specified components on the latent representation of the later one (exclude its ancestors). Note that the conditional independent relationships between components needs to be a directed acyclic graph.
- required:
conditioned_on_z_hat
:- required:
False
. - defaults:
List[]
- type:
List[str]
- description: the provided string in the list needs to be the name that matched to one of the specified Components. The current component will be conditionally independent with the specified components on the latent representation of the later one (include its ancestors). Note that the conditional independent relationships between components needs to be a directed acyclic graph.
- required:
modalities
:- required:
True
. - type:
List[Config]
- description: see Modalities (in Component)
- required:
The configs for the modalities in the component.
name
:- required:
True
. - type:
str
. - description: the name of the corresponding modality. Must match one of the name specified in Modalities.
- required:
n_decoder_layers
:- required:
False
. - defaults:
3
- type:
int
. - description: the number of hidden layers used in the decoder neural network.
- required:
save_z
:- required:
False
. - defaults:
True
. - type:
bool
. - description: whether or not to save the predicted
z
andz_hat
toobsm
of the modality.
- required:
save_x
:- required:
False
. - defaults:
True
. - type:
bool
. - description: whether or not to save the predicted
x_parameters
toobsm
of the modality. Note thatx_parameters
will not be predicted by defaults if none of the modalities in the component setsave_x
.
- required:
The configs for the training process. See also Optimizer.
train
:- required:
False
. - defaults:
True
. - type:
bool
. - description: whether or not to train or finetune the model.
- required:
early_stop
:- required:
False
. - defaults:
True
. - type:
bool
. - description: whether or not to use early stopping when training the model. Ignored if
train=False
.
- required:
max_n_epochs
:- required:
False
. - defaults:
1000
. - type:
int
. - description: maxmimum number of epochs used during the training process (set globally for all components).
- required:
optimizer
:- required:
False
. - defaults:
OptimizerConfig({'name': 'adam', 'learning_rate': 1e-4})
- type:
OptimizerConfig
- description: see Optimizer and
cavachon/config/OptimizerConfig
for more details.
- required:
The configs for the optimizer.
name
:- required:
False
. - defaults:
adam
- type:
str
- description: string representation for the Tensorflow Keras optimizer. See tf.keras.optimizers for more details.
- required:
learning_rate
:- required:
False
. - defaults:
1e-4
- type:
float
- description: learning rate for the specified optimizers .
- required:
The configs for the dataset.
batch_size
:- required:
False
. - deafults:
128
. - type:
int
- description: batch size used to train and evaluate the model. The higher the value, the more efficient the training process will be but more memory will be used.
- required:
shuffle
:- required:
False
. - deafults:
False
. - type:
bool
- description: whether or not to shuffle the dataset during training.
- required: