Single modality tasks
Cell type annotation
- class dance.modules.single_modality.cell_type_annotation.ACTINN(*, hidden_dims=(100, 50, 25), lambd=0.01, device='cpu', random_seed=None)[source]
The ACTINN cell-type classification model.
- Parameters:
hidden_dims (
Tuple
[int
,...
]) – Hidden layer dimensions.lambd (
float
) – Regularization parameterdevice (
str
) – Training devicerandom_seed (int | None) –
- compute_loss(z, y)[source]
Compute loss function.
- Parameters:
z (
Tensor
) – Output of forward propagation (cells by cell-types).y (
Tensor
) – Cell labels (cells).
- Returns:
Loss.
- Return type:
torch.Tensor
- fit(x_train, y_train, *, batch_size=128, lr=0.01, num_epochs=50, print_cost=False, seed=None)[source]
Fit the classifier.
- Parameters:
x_train (
Tensor
) – training data (cells by genes).y_train (
Tensor
) – training labels (cells by cell-types).batch_size (
int
) – Training batch size.lr (
float
) – Initial learning rate.num_epochs (
int
) – Number of epochs to run.print_cost (
bool
) – Print training loss if set to True.seed (
Optional
[int
]) – Random seed, if set to None, then random.
- predict(x)[source]
Predict cell labels.
- Parameters:
x (
Tensor
) – Gene expression input features (cells by genes).- Returns:
Predicted cell-label indices.
- Return type:
torch.Tensor
- random_batches(x, y, batch_size=32, seed=None)[source]
Shuffle data and split into batches.
- Parameters:
x (
Tensor
) – Training data (cells by genes).y (
Tensor
) – True labels (cells by cell-types).batch_size (int) –
seed (int | None) –
- Yields:
Tuple[torch.Tensor, torch.Tensor] – Batch of training data (x, y).
- class dance.modules.single_modality.cell_type_annotation.Celltypist(majority_voting=False, clf=None, scaler=None, description=None)[source]
The CellTypist cell annotation method.
- Parameters:
majority_voting (
bool
) – Whether to refine the predicted labels by running the majority voting classifier after over-clustering.
- fit(indata, labels=None, C=1.0, solver=None, max_iter=1000, n_jobs=None, use_SGD=False, alpha=0.0001, mini_batch=False, batch_number=100, batch_size=1000, epochs=10, balance_cell_type=False, feature_selection=False, top_genes=300, **kwargs)[source]
Train a celltypist model using mini-batch (optional) logistic classifier with a global solver or stochastic gradient descent (SGD) learning.
- Parameters:
indata (np.ndarray) – Input gene expression matrix (cell x gene).
labels (np.array) – 1-D numpy array indicating cell-type identities of each cell (in index of the cell-types).
C (float optional) – Inverse of L2 regularization strength for traditional logistic classifier. A smaller value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is enabled (
use_SGD = True
). (Default: 1.0)solver (str optional) – Algorithm to use in the optimization problem for traditional logistic classifier. The default behavior is to choose the solver according to the size of the input data. This argument is ignored if SGD learning is enabled (
use_SGD = True
).max_iter (int optional) – Maximum number of iterations before reaching the minimum of the cost function. Try to decrease
max_iter
if the cost function does not converge for a long time. This argument is for both traditional and SGD logistic classifiers, and will be ignored if mini-batch SGD training is conducted (use_SGD = True
andmini_batch = True
). (Default: 1000)n_jobs (int optional) – Number of CPUs used. Default to one CPU.
-1
means all CPUs are used. This argument is for both traditional and SGD logistic classifiers.use_SGD (bool optional) – Whether to implement SGD learning for the logistic classifier. (Default:
False
)alpha (float optional) – L2 regularization strength for SGD logistic classifier. A larger value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is disabled (
use_SGD = False
). (Default: 0.0001)mini_batch (bool optional) – Whether to implement mini-batch training for the SGD logistic classifier. Setting to
True
may improve the training efficiency for large datasets (for example, >100k cells). This argument is ignored if SGD learning is disabled (use_SGD = False
). (Default:False
)batch_number (int optional) – The number of batches used for training in each epoch. Each batch contains
batch_size
cells. For datasets which cannot be binned intobatch_number
batches, all batches will be used. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True
andmini_batch = True
). (Default: 100)batch_size (int optional) – The number of cells within each batch. This argument is relevant only if mini-batch SGD training is conducted (
use_SGD = True
andmini_batch = True
). (Default: 1000)epochs (int optional) – The number of epochs for the mini-batch training procedure. The default values of
batch_number
,batch_size
, andepochs
together allow observing ~10^6 training cells. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True
andmini_batch = True
). (Default: 10)balance_cell_type (bool optional) – Whether to balance the cell type frequencies in mini-batches during each epoch. Setting to
True
will sample rare cell types with a higher probability, ensuring close-to-even cell type distributions in mini-batches. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True
andmini_batch = True
). (Default:False
)feature_selection (bool optional) – Whether to perform two-pass data training where the first round is used for selecting important features/genes using SGD learning. If
True
, the training time will be longer. (Default:False
)top_genes (int optional) – The number of top genes selected from each class/cell-type based on their absolute regression coefficients. The final feature set is combined across all classes (i.e., union). (Default: 300)
**kwargs – Other keyword arguments passed to
LogisticRegression
(use_SGD = False
) orSGDClassifier
(use_SGD = True
).
- Returns:
An instance of the
Model
trained by celltypist.- Return type:
Model
- predict(x, as_obj=False, over_clustering=None, min_prop=0)[source]
Run the prediction and (optional) majority voting to annotate the input dataset.
- Parameters:
x (np.ndarray) – Input expression matrix (cell x gene).
as_obj (bool) – If set to
True
, then return the prediction results areAnnotationResult
. Otherwise, return the predicted cell-label indexes ad 1-d numpy array instead. (Default:False
)over_clustering (Union[str, list, tuple, np.ndarray, pd.Series, pd.Index] optional) –
This argument can be provided in several ways: 1) an input plain file with the over-clustering result of one cell per line. 2) a string key specifying an existing metadata column in the AnnData (pre-created by the user). 3) a python list, tuple, numpy array, pandas series or index representing the over-clustering result of the
input cells.
if none of the above is provided, will use a heuristic over-clustering approach according to the size of input data.
Ignored if
majority_voting
is set toFalse
.min_prop (float optional) – For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. Ignored if
majority_voting
is set toFalse
. Subcluster that fails to pass this proportion threshold will be assigned'Heterogeneous'
. (Default: 0)
- Return type:
Union
[ndarray
,AnnotationResult
]
- class dance.modules.single_modality.cell_type_annotation.SVM(args, prj_path='./', random_state=None)[source]
The SVM cell-type classification model.
- Parameters:
args (argparse.Namespace) – A Namespace contains arguments of SVM. See parser help document for more info.
prj_path (str) – project path
random_state (int | None) –
- fit(x, y)[source]
Train the classifier.
- Parameters:
x (
ndarray
) – Training cell features.y (
ndarray
) – Training labels.
- class dance.modules.single_modality.cell_type_annotation.ScDeepSort(dim_in, dim_hid, num_layers, species, tissue, *, dropout=0, batch_size=500, device='cpu')[source]
The ScDeepSort cell-type annotation model.
- Parameters:
dim_in (
int
) – Input dimension, i.e., the number of PCA used for cell and gene features.dim_hid (
int
) – Hidden dimension.num_layers (
int
) – Number of convolution layers.species (
str
) – Species name (only used for determining the read/write path).tissue (
str
) – Tissue name (only used for determining the read/write path).dropout (
int
) – Drop-out rate.batch_size (
int
) – Batch size.device (
str
) – Computation device, e.g., ‘cpu’, ‘cuda’.
- cal_loss(graph, idx)[source]
Calculate loss.
- Parameters:
graph (
DGLGraph
) – Input cell-gene graph object.idx (
Tensor
) – 1-D tensor containing the indexes of the cell nodes to calculate the loss.
- Returns:
Averaged loss over all batches.
- Return type:
float
- evaluate(graph, idx, unsure_rate=2.0)[source]
Evaluate the model on certain cell nodes.
- Parameters:
idx (
Tensor
) – 1-D tensor containing the indexes of the cell nodes to be evaluated.graph (DGLGraph) –
unsure_rate (float) –
- Returns:
The total number of correct prediction, the total number of unsure prediction, and the accuracy score.
- Return type:
Tuple[int, int, float]
- fit(graph, labels, epochs=300, lr=0.001, weight_decay=0, val_ratio=0.2)[source]
Train scDeepsort model.
- Parameters:
graph (
DGLGraph
) – Training graph.labels (
Tensor
) – Node (cell, gene) labels, -1 for genes.epochs (
int
) – Number of epochs to train the model.lr (
float
) – Learning rate.weight_decay (
float
) – Weight decay regularization strength.val_ratio (
float
) – Ratio of the training data to hold out for validation.
- predict(graph, unsure_rate=2.0, return_unsure=False)[source]
Perform prediction on all test datasets.
- Parameters:
graph (
DGLGraph
) – Input cell-gene grahp to be predicted.unsure_rate (
float
) – Determine the threshold of the maximum predicted probability under which the predictions are considered uncertain.return_unsure (
bool
) – If set toTrue
, then return an indicator array that indicates whether a prediction is uncertain.
- class dance.modules.single_modality.cell_type_annotation.SingleCellNet(num_trees=100)[source]
The SingleCellNet model.
- Parameters:
num_trees (
int
) – Number of trees in the random forest model.
- fit(x, y, num_rand=100, stratify=True, random_state=100)[source]
Train the SingleCellNet random forest model.
- Parameters:
x – Input features.
y – Labels.
stratify (
bool
) – Whether we select balanced class weight in the random forest model.random_state (
Optional
[int
]) – Random state.num_rand (int) –
- predict(x)[source]
Predict cell type label.
- Parameters:
x – Input features.
- Returns:
The most likely cell-type label of each sample.
- Return type:
np.ndarray
- predict_proba(x)[source]
Calculate predicted probabilities.
- Parameters:
x – Input featurex.
- Returns:
Cell-type probability matrix where each row is a cell and each column is a cell-type. The values in the matrix indicate the predicted probability that the cell is a particular cell-type. The last column corresponds to the probability that the model could not confidently identify the cell type of the cell.
- Return type:
np.ndarray
Clustering
- class dance.modules.single_modality.clustering.GraphSC(agg='sum', activation='relu', in_feats=50, n_hidden=1, hidden_dim=200, hidden_1=300, hidden_2=0, dropout=0.1, n_layers=1, hidden_relu=False, hidden_bn=False, n_clusters=10, cluster_method='kmeans', num_workers=1, device='auto')[source]
GraphSC class.
- Parameters:
agg (
str
) – Aggregation layer.activation (
str
) – Activation function.in_feats (
int
) – Dimension of input featuren_hidden (
int
) – Number of hidden layer.hidden_dim (
int
) – Input dimension of hidden layer 1.hidden_1 (
int
) – Output dimension of hidden layer 1.hidden_2 (
int
) – Output dimension of hidden layer 2.dropout (
float
) – Dropout rate.n_layers (
int
) – Number of graph convolutional layers.hidden_relu (
bool
) – Use relu activation in hidden layers or not.hidden_bn (
bool
) – Use batch norm in hidden layers or not.cluster_method (
Literal
['kmeans'
,'leiden'
]) – Method for clustering.num_workers (
int
) – Number of workers.device (
str
) – Computation device to use.n_clusters (int) –
- fit(g, y=None, *, epochs=100, lr=1e-05, batch_size=128, show_epoch_ari=False, eval_epoch=False)[source]
Train graph-sc.
- Parameters:
g (
DGLGraph
) – Input cell-gene graph.y (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.epochs (
int
) – Number of epochs.lr (
float
) – Learning rate.batch_size (
int
) – Batch size.show_epoch_ari (
bool
) – Show ARI score for each epocheval_epoch (
bool
) – Evaluate every epoch.
- class dance.modules.single_modality.clustering.ScDCC(input_dim, z_dim, n_clusters, encodeLayer, decodeLayer, activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, ml_weight=1.0, cl_weight=1.0, device='auto', pretrain_path=None)[source]
ScDCC class.
- Parameters:
input_dim (
int
) – Dimension of encoder input.z_dim (
int
) – Dimension of embedding.n_clusters (
int
) – Number of clusters.encodeLayer (
List
[int
]) – Dimensions of encoder layers.decodeLayer (
List
[int
]) – Dimensions of decoder layers.activation (
str
) – Activation function.sigma (
float
) – Parameter of Gaussian noise.alpha (
float
) – Parameter of soft assign.gamma (
float
) – Parameter of cluster loss.ml_weight (
float
) – Parameter of must-link loss.cl_weight (
float
) – Parameter of cannot-link loss.device (
str
) – Computation device.pretrain_path (str | None) –
- cluster_loss(p, q)[source]
Calculate cluster loss.
- Parameters:
p – Target distribution.
q – Soft label.
- Returns:
- Return type:
Cluster loss.
- encodeBatch(X, batch_size=256)[source]
Batch encoder.
- Parameters:
X – Input features.
batch_size – Size of batch.
- Returns:
- Return type:
Embedding.
- fit(inputs, y=None, ml_ind1=array([], dtype=float64), ml_ind2=array([], dtype=float64), cl_ind1=array([], dtype=float64), cl_ind2=array([], dtype=float64), ml_p=1.0, cl_p=1.0, lr=1.0, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]
Train model.
- Parameters:
inputs (
Tuple
[ndarray
,ndarray
,ndarray
]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.y (
Optional
[ndarray
]) – True label. Used for model selection.ml_ind1 (
ndarray
) – Index 1 of must-link pairs.ml_ind2 (
ndarray
) – Index 2 of must-link pairs.cl_ind1 (
ndarray
) – Index 1 of cannot-link pairs.cl_ind2 (
ndarray
) – Index 2 of cannot-link pairs.ml_p (
float
) – Parameter of must-link loss.cl_p (
float
) – Parameter of cannot-link loss.lr (
float
) – Learning rate.batch_size (
int
) – Size of batch.epochs (
int
) – Number of epochs.update_interval (
int
) – Update interval of soft label and target distribution.tol (
float
) – Tolerance for training loss.pt_batch_size (
int
) – Pretrain batch size.pt_lr (
float
) – Pretrain learning rate.pt_epochs (
int
) – Pretrain epochs.
- forward(x)[source]
Forward propagation.
- Parameters:
x – Input features.
- Returns:
z0 – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.
- pairwise_loss(p1, p2, cons_type)[source]
Calculate pairwise loss.
- Parameters:
p1 – Distribution 1.
p2 – Distribution 2.
cons_type – Type of loss.
- Returns:
- Return type:
Pairwise loss.
- predict(x=None)[source]
Get predictions from the trained model.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted clustering assignment for each cell.
- Return type:
pred
- predict_proba(x=None)[source]
Get the predicted propabilities for each cell.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted probability for each cell.
- Return type:
pred_prop
- pretrain(x, X_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]
Pretrain autoencoder.
- Parameters:
x – Input features.
X_raw – Raw input features.
n_counts – Total counts for each cell.
batch_size – Size of batch.
lr – Learning rate.
epochs – Number of epochs.
- class dance.modules.single_modality.clustering.ScDSC(pretrain_path, sigma=1, n_enc_1=512, n_enc_2=256, n_enc_3=256, n_dec_1=256, n_dec_2=256, n_dec_3=512, n_z1=256, n_z2=128, n_z3=32, n_clusters=100, n_input=10, v=1, device='auto')[source]
ScDSC wrapper class.
- Parameters:
pretrain_path (
str
) – Path of saved autoencoder weights.sigma (
float
) – Balance parameter.n_enc_1 (
int
) – Output dimension of encoder layer 1.n_enc_2 (
int
) – Output dimension of encoder layer 2.n_enc_3 (
int
) – Output dimension of encoder layer 3.n_dec_1 (
int
) – Output dimension of decoder layer 1.n_dec_2 (
int
) – Output dimension of decoder layer 2.n_dec_3 (
int
) – Output dimension of decoder layer 3.n_z1 (
int
) – Output dimension of hidden layer 1.n_z2 (
int
) – Output dimension of hidden layer 2.n_z3 (
int
) – Output dimension of hidden layer 3.n_clusters (
int
) – Number of clusters.n_input (
int
) – Input feature dimension.v (
float
) – Parameter of soft assignment.device (
str
) – Computing device.
- fit(inputs, y, lr=0.001, epochs=300, bcl=0.1, cl=0.01, rl=1, zl=0.1, pt_epochs=200, pt_batch_size=256, pt_lr=0.001)[source]
Train model.
- Parameters:
inputs (
Tuple
[spmatrix
,ndarray
,ndarray
,Series
]) – A tuple containing (1) the adjacency matrix, (2) the input features, (3) the raw input features, and (4) the total counts for each cell.y (
ndarray
) – Label.lr (
float
) – Learning rate.epochs (
int
) – Number of epochs.bcl (
float
) – Parameter of binary crossentropy loss.cl (
float
) – Parameter of Kullback–Leibler divergence loss.rl (
float
) – Parameter of reconstruction loss.zl (
float
) – Parameter of ZINB loss.pt_epochs (int) –
pt_batch_size (int) –
pt_lr (float) –
- predict(x=None)[source]
Get predictions from the trained model.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted clustering assignment for each cell.
- Return type:
pred
- predict_proba(x=None)[source]
Get the predicted propabilities for each cell.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted probability for each cell.
- Return type:
pred_prop
- class dance.modules.single_modality.clustering.ScDeepCluster(input_dim, z_dim, encodeLayer=[], decodeLayer=[], activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, device='cuda', pretrain_path=None)[source]
ScDeepCluster class.
- Parameters:
input_dim – Dimension of encoder input.
z_dim – Dimension of embedding.
encodeLayer – Dimensions of encoder layers.
decodeLayer – Dimensions of decoder layers.
activation – Activation function.
sigma – Parameter of Gaussian noise.
alpha – Parameter of soft assign.
gamma – Parameter of cluster loss.
device – Computing device.
pretrain_path (
Optional
[str
]) – Path to pretrained weights.
- cluster_loss(p, q)[source]
Calculate cluster loss.
- Parameters:
p – Target distribution.
q – Soft label.
- Returns:
Cluster loss.
- Return type:
loss
- encodeBatch(x, batch_size=256)[source]
Batch encoder.
- Parameters:
x – Input features.
batch_size – Size of batch.
- Returns:
Embedding.
- Return type:
encoded
- fit(inputs, y, n_clusters=10, init_centroid=None, y_pred_init=None, lr=1, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]
Train model.
- Parameters:
inputs (
Tuple
[ndarray
,ndarray
,ndarray
]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.y (
ndarray
) – True label. Used for model selection.n_clusters (
int
) – Number of clusters.init_centroid (
Optional
[List
[int
]]) – Initialization of centroids. If None, perform kmeans to initialize cluster centers.y_pred_init (
Optional
[List
[int
]]) – Predicted label for initialization.lr (
float
) – Learning rate.batch_size (
int
) – Size of batch.epochs (
int
) – Number of epochs.update_interval (
int
) – Update interval of soft label and target distribution.tol (
float
) – Tolerance for training loss.pt_batch_size (
int
) – Pretraining batch size.pt_lr (
float
) – Pretraining learning rate.pt_epochs (
int
) – pretraining epochs.
- forward(x)[source]
Forward propagation.
- Parameters:
x – Input features.
- Returns:
z0 – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.
- forwardAE(x)[source]
Forward propagation of autoencoder.
- Parameters:
x – Input features.
- Returns:
z0 – Embedding.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.
- predict(x=None)[source]
Get predictions from the trained model.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted clustering assignment for each cell.
- Return type:
pred
- predict_proba(x=None)[source]
Get the predicted propabilities for each cell.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the BaseClusteringMethod class.- Returns:
Predicted probability for each cell.
- Return type:
pred_prop
- pretrain(x, x_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]
Pretrain autoencoder.
- Parameters:
x – Input features.
x_raw – Raw input features.
n_counts – Total counts for each cell.
batch_size – Size of batch.
lr – Learning rate.
epochs – Number of epochs.
- class dance.modules.single_modality.clustering.ScTAG(n_clusters, k=3, hidden_dim=128, latent_dim=15, dec_dim=None, dropout=0.2, device='cuda', alpha=1.0, pretrain_path=None)[source]
The scTAG clustering model.
- Parameters:
n_clusters (
int
) – Number of clusters.k (
int
) – Number of hops of TAG convolutional layer.hidden_dim (
int
) – Dimension of hidden layer.latent_dim (
int
) – Dimension of latent embedding.dec_dim (
Optional
[int
]) – Dimensions of decoder layers.dropout (
float
) – Dropout rate.device (
str
) – Computing device.alpha (
float
) – Parameter of soft assign.pretrain_path (
Optional
[str
]) – Path to save the pretrained autoencoder. If not specified, then do not save/load.
- fit(inputs, y, *, epochs=300, pretrain_epochs=200, lr=0.0005, w_a=0.3, w_x=1, w_c=1.5, w_d=0, info_step=1, max_dist=20, min_dist=0.5, force_pretrain=False)[source]
Pretrain autoencoder.
- Parameters:
inputs (
Tuple
[ndarray
,ndarray
,ndarray
,ndarray
]) – A tuple containing the adjacency matrix, the input feature, the raw input feature, and the total counts per cell array.epochs (
int
) – Number of epochs.lr (
float
) – Learning rate.w_a (
float
) – Parameter of reconstruction loss.w_x (
float
) – Parameter of ZINB loss.w_c (
float
) – Parameter of clustering loss.w_d (
float
) – Parameter of pairwise distance loss.info_step (
int
) – Interval of showing pretraining loss.min_dist (
float
) – Minimum distance of pairwise distance loss.max_dist (
float
) – Maximum distance of pairwise distance loss.force_pretrain (
bool
) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.y (ndarray) –
pretrain_epochs (int) –
- forward(g, x_input)[source]
Forward propagation.
- Parameters:
g – Input graph.
x_input – Input features.
- Returns:
adj_out – Reconstructed adjacency matrix.
z – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.
- predict(x=None)[source]
Get predictions from the trained model.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the base module class.- Returns:
Prediction of given clustering method.
- Return type:
pred
- predict_proba(x=None)[source]
Get predicted probabilities for each cell.
- Parameters:
x (
Optional
[Any
]) – Not used, for compatibility with the base module class.- Returns:
Predicted probabilities for each cell.
- Return type:
pred_prob
- pretrain(adj, x, x_raw, n_counts, *, epochs=1000, info_step=10, lr=0.0005, w_a=0.3, w_x=1, w_d=0, min_dist=0.5, max_dist=20, force_pretrain=False)[source]
Pretrain autoencoder.
- Parameters:
adj – Adjacency matrix.
x – Input features.
x_raw – Raw input features.
n_counts – Total counts for each cell.
epochs (
int
) – Number of epochs.info_step (
int
) – Interval of showing pretraining loss.lr (
float
) – Learning rate.w_a (
float
) – Parameter of reconstruction loss.w_x (
float
) – Parameter of ZINB loss.w_d (
float
) – Parameter of pairwise distance loss.min_dist (
float
) – Minimum distance of pairwise distance loss.max_dist (
float
) – Maximum distance of pairwise distance loss.force_pretrain (
bool
) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.
Imputation
- class dance.modules.single_modality.imputation.DeepImpute(predictors, targets, dataset, sub_outputdim=512, hidden_dim=256, dropout=0.2, seed=1, gpu=-1)[source]
DeepImpute class.
- Parameters:
learning_rate (float optional) – learning rate
batch_size (int optional) – batch size
max_epochs (int optional) – maximum epochs
patience (int optional) – number of epochs before stopping once loss stops to improve
gpu (int optional) – option to use gpu
loss (string optional) – loss function
output_prefix (string optinal) – directory to save outputs
sub_outputdim (int optional) – output dimensions in each subnetwork
hidden_dim (int optional) – dimension of the dense layer in each subnetwork
verbose (int optional) – verbose option
seed (int optional) – random seed
architecture (optional) – network architecture
imputed_only (boolean optional) – whether to return imputed genes only
policy (string optional) – imputation policy
- build(inputdims, outputdims, device='cpu')[source]
Build model.
- Parameters:
inputdims (int) – number of neurons as input in the first layer
- Returns:
models – array of subnetworks
- Return type:
array
- fit(X, Y, mask=None, batch_size=64, lr=0.001, n_epochs=100, patience=5, train_idx=None)[source]
Train model.
- Parameters:
X_train (optional) – Training data including input genes
Y_train (optional) – Training data including target genes to be inputed
X_valid (optional) – Validation data including input predictor genes
Y_valid (optional) – Validation data including target genes to be inputed
predictors (array optional) – input genes as predictors for target genes
- Returns:
- Return type:
None
- load_model(model, i)[source]
Load model.
- Parameters:
model – model to be loaded
i (int) – index of the subnetwork to be loaded
- Returns:
loaded model
- Return type:
model
- predict(X_test, mask=None, test_idx=None, predict_raw=False)[source]
Get predictions from the trained model.
- Parameters:
targetgenes (array optional) – genes to be imputed
- Returns:
imputed – imputed gene expression
- Return type:
DataFrame
- save_model(model, optimizer, i)[source]
Save model.
- Parameters:
model – model to be saved
optimizer – optimizer
i (int) – index of the subnetwork to be loaded
- Returns:
- Return type:
None
- score(true_expr, imputed_expr, mask=None, metric='MSE', test_idx=None)[source]
Scoring function of model.
- Parameters:
true_expr – True underlying expression values
imputed_expr – Imputed expression values
test_idx – index of testing cells
metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’
- Returns:
evaluation score
- Return type:
score
- class dance.modules.single_modality.imputation.GraphSCI(num_cells, num_genes, dataset, dropout=0.1, gpu=-1, seed=1)[source]
GraphSCI model, combination AE and GNN.
- Parameters:
num_cells (int) – number of cells in expression data
num_genes (int) – number of genes in expression data
dataset (str) – name of training dataset
n_epochs (int optional) – number of training epochs
lr (float optional) – learning rate
weight_decay (float optional) – weight decay rate
dropout (float optional) – probability of weight dropout for training
gpu (int optional) – index of computing device, -1 for cpu.
- evaluate(features, features_raw, graph, mask=None, le=1, la=1, ke=1, ka=1)[source]
Evaluate function, returns loss and reconstructions of expression and adjacency.
- Parameters:
features – input features
features_raw – input raw features
adj_norm – normalized adjacency matrix of gene graph
adj_orig – training adjacency matrix of gene graph
size_factors – cell size factors for reconstruction
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency
- fit(train_data, train_data_raw, graph, mask=None, le=1, la=1, ke=1, ka=1, n_epochs=100, lr=0.001, weight_decay=1e-05, train_idx=None)[source]
Data fitting function.
- Parameters:
train_data – input training features
train_data_raw – input raw training features
adj_train – training adjacency matrix of gene graph
train_size_factors – train size factors for cells
adj_norm_train – normalized training adjacency matrix of gene graph
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency
- Returns:
- Return type:
None
- get_loss(batch, adj_orig, z_adj, z_adj_log_std, z_adj_mean, z_exp, mean, disp, pi, mask, le=1, la=1, ke=1, ka=1)[source]
Loss function for GraphSCI.
- Parameters:
batch – batch features
z_adj – reconstructed adjacency matrix
z_adj_std – standard deviation of distribution of z_adj
z_adj_mean – mean of distributino of z_adj
z_exp – recontruction of expression values
mean – dropout parameter of ZINB dist of z_exp
disp – dropout parameter of ZINB dist of z_exp
pi – dispersion parameter of ZINB dist of z_exp
sf – cell size factors
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency
- Returns:
loss_adj (float) – loss of adjacency reconstruction
loss_exp (float) – loss of expression reconstruction
log_lik (float) – log likelihood loss value
kl (float) – kullback leibler loss
loss (float) – log_lik - kl
- predict(data, data_raw, graph, mask=None)[source]
Predict function.
- Parameters:
data – input true expression data
data_raw – raw input true expression data
adj_norm – normalized adjacency matrix of gene graph
adj_orig – adjacency matrix of gene graph
size_factors – cell size factors for reconstruction
- Returns:
reconstructed expression data
- Return type:
z_exp
- score(true_expr, imputed_expr, mask=None, metric='MSE', log1p=True, test_idx=None)[source]
Scoring function of model.
- Parameters:
true_expr – True underlying expression values
imputed_expr – Imputed expression values
test_idx – index of testing cells
metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’
- Returns:
evaluation score
- Return type:
score
- train(train_data, train_data_raw, graph, train_mask, valid_mask, le=1, la=1, ke=1, ka=1)[source]
Train function, gets loss and performs optimization step.
- Parameters:
train_data – input training features
train_data_raw – input raw training features
adj_orig – training adjacency matrix of gene graph
size_factors – train size factors for cells
adj_norm – normalized training adjacency matrix of gene graph
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency
- Returns:
total_loss – loss value of training loop
- Return type:
float