Single modality tasks

Cell type annotation

class dance.modules.single_modality.cell_type_annotation.ACTINN(*, hidden_dims=(100, 50, 25), lambd=0.01, device='cpu', random_seed=None)[source]

The ACTINN cell-type classification model.

Parameters:
  • hidden_dims (Tuple[int, ...]) – Hidden layer dimensions.

  • lambd (float) – Regularization parameter

  • device (str) – Training device

  • random_seed (int | None) –

compute_loss(z, y)[source]

Compute loss function.

Parameters:
  • z (Tensor) – Output of forward propagation (cells by cell-types).

  • y (Tensor) – Cell labels (cells).

Returns:

Loss.

Return type:

torch.Tensor

fit(x_train, y_train, *, batch_size=128, lr=0.01, num_epochs=50, print_cost=False, seed=None)[source]

Fit the classifier.

Parameters:
  • x_train (Tensor) – training data (cells by genes).

  • y_train (Tensor) – training labels (cells by cell-types).

  • batch_size (int) – Training batch size.

  • lr (float) – Initial learning rate.

  • num_epochs (int) – Number of epochs to run.

  • print_cost (bool) – Print training loss if set to True.

  • seed (Optional[int]) – Random seed, if set to None, then random.

predict(x)[source]

Predict cell labels.

Parameters:

x (Tensor) – Gene expression input features (cells by genes).

Returns:

Predicted cell-label indices.

Return type:

torch.Tensor

random_batches(x, y, batch_size=32, seed=None)[source]

Shuffle data and split into batches.

Parameters:
  • x (Tensor) – Training data (cells by genes).

  • y (Tensor) – True labels (cells by cell-types).

  • batch_size (int) –

  • seed (int | None) –

Yields:

Tuple[torch.Tensor, torch.Tensor] – Batch of training data (x, y).

class dance.modules.single_modality.cell_type_annotation.Celltypist(majority_voting=False, clf=None, scaler=None, description=None)[source]

The CellTypist cell annotation method.

Parameters:

majority_voting (bool) – Whether to refine the predicted labels by running the majority voting classifier after over-clustering.

fit(indata, labels=None, C=1.0, solver=None, max_iter=1000, n_jobs=None, use_SGD=False, alpha=0.0001, mini_batch=False, batch_number=100, batch_size=1000, epochs=10, balance_cell_type=False, feature_selection=False, top_genes=300, **kwargs)[source]

Train a celltypist model using mini-batch (optional) logistic classifier with a global solver or stochastic gradient descent (SGD) learning.

Parameters:
  • indata (np.ndarray) – Input gene expression matrix (cell x gene).

  • labels (np.array) – 1-D numpy array indicating cell-type identities of each cell (in index of the cell-types).

  • C (float optional) – Inverse of L2 regularization strength for traditional logistic classifier. A smaller value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is enabled (use_SGD = True). (Default: 1.0)

  • solver (str optional) – Algorithm to use in the optimization problem for traditional logistic classifier. The default behavior is to choose the solver according to the size of the input data. This argument is ignored if SGD learning is enabled (use_SGD = True).

  • max_iter (int optional) – Maximum number of iterations before reaching the minimum of the cost function. Try to decrease max_iter if the cost function does not converge for a long time. This argument is for both traditional and SGD logistic classifiers, and will be ignored if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 1000)

  • n_jobs (int optional) – Number of CPUs used. Default to one CPU. -1 means all CPUs are used. This argument is for both traditional and SGD logistic classifiers.

  • use_SGD (bool optional) – Whether to implement SGD learning for the logistic classifier. (Default: False)

  • alpha (float optional) – L2 regularization strength for SGD logistic classifier. A larger value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is disabled (use_SGD = False). (Default: 0.0001)

  • mini_batch (bool optional) – Whether to implement mini-batch training for the SGD logistic classifier. Setting to True may improve the training efficiency for large datasets (for example, >100k cells). This argument is ignored if SGD learning is disabled (use_SGD = False). (Default: False)

  • batch_number (int optional) – The number of batches used for training in each epoch. Each batch contains batch_size cells. For datasets which cannot be binned into batch_number batches, all batches will be used. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 100)

  • batch_size (int optional) – The number of cells within each batch. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 1000)

  • epochs (int optional) – The number of epochs for the mini-batch training procedure. The default values of batch_number, batch_size, and epochs together allow observing ~10^6 training cells. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 10)

  • balance_cell_type (bool optional) – Whether to balance the cell type frequencies in mini-batches during each epoch. Setting to True will sample rare cell types with a higher probability, ensuring close-to-even cell type distributions in mini-batches. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: False)

  • feature_selection (bool optional) – Whether to perform two-pass data training where the first round is used for selecting important features/genes using SGD learning. If True, the training time will be longer. (Default: False)

  • top_genes (int optional) – The number of top genes selected from each class/cell-type based on their absolute regression coefficients. The final feature set is combined across all classes (i.e., union). (Default: 300)

  • **kwargs – Other keyword arguments passed to LogisticRegression (use_SGD = False) or SGDClassifier (use_SGD = True).

Returns:

An instance of the Model trained by celltypist.

Return type:

Model

predict(x, as_obj=False, over_clustering=None, min_prop=0)[source]

Run the prediction and (optional) majority voting to annotate the input dataset.

Parameters:
  • x (np.ndarray) – Input expression matrix (cell x gene).

  • as_obj (bool) – If set to True, then return the prediction results are AnnotationResult. Otherwise, return the predicted cell-label indexes ad 1-d numpy array instead. (Default: False)

  • over_clustering (Union[str, list, tuple, np.ndarray, pd.Series, pd.Index] optional) –

    This argument can be provided in several ways: 1) an input plain file with the over-clustering result of one cell per line. 2) a string key specifying an existing metadata column in the AnnData (pre-created by the user). 3) a python list, tuple, numpy array, pandas series or index representing the over-clustering result of the

    input cells.

    1. if none of the above is provided, will use a heuristic over-clustering approach according to the size of input data.

    Ignored if majority_voting is set to False.

  • min_prop (float optional) – For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. Ignored if majority_voting is set to False. Subcluster that fails to pass this proportion threshold will be assigned 'Heterogeneous'. (Default: 0)

Return type:

Union[ndarray, AnnotationResult]

class dance.modules.single_modality.cell_type_annotation.SVM(args, prj_path='./', random_state=None)[source]

The SVM cell-type classification model.

Parameters:
  • args (argparse.Namespace) – A Namespace contains arguments of SVM. See parser help document for more info.

  • prj_path (str) – project path

  • random_state (int | None) –

fit(x, y)[source]

Train the classifier.

Parameters:
  • x (ndarray) – Training cell features.

  • y (ndarray) – Training labels.

predict(x)[source]

Predict cell labels.

Parameters:

x (ndarray) – Samples to be predicted (samplex x features).

Returns:

Predicted labels of the input samples.

Return type:

y

save(num, pred)[source]

Save the predictions.

Parameters:
  • num (int) – test file name

  • pred (dict) – prediction labels

class dance.modules.single_modality.cell_type_annotation.ScDeepSort(dim_in, dim_hid, num_layers, species, tissue, *, dropout=0, batch_size=500, device='cpu')[source]

The ScDeepSort cell-type annotation model.

Parameters:
  • dim_in (int) – Input dimension, i.e., the number of PCA used for cell and gene features.

  • dim_hid (int) – Hidden dimension.

  • num_layers (int) – Number of convolution layers.

  • species (str) – Species name (only used for determining the read/write path).

  • tissue (str) – Tissue name (only used for determining the read/write path).

  • dropout (int) – Drop-out rate.

  • batch_size (int) – Batch size.

  • device (str) – Computation device, e.g., ‘cpu’, ‘cuda’.

cal_loss(graph, idx)[source]

Calculate loss.

Parameters:
  • graph (DGLGraph) – Input cell-gene graph object.

  • idx (Tensor) – 1-D tensor containing the indexes of the cell nodes to calculate the loss.

Returns:

Averaged loss over all batches.

Return type:

float

evaluate(graph, idx, unsure_rate=2.0)[source]

Evaluate the model on certain cell nodes.

Parameters:
  • idx (Tensor) – 1-D tensor containing the indexes of the cell nodes to be evaluated.

  • graph (DGLGraph) –

  • unsure_rate (float) –

Returns:

The total number of correct prediction, the total number of unsure prediction, and the accuracy score.

Return type:

Tuple[int, int, float]

fit(graph, labels, epochs=300, lr=0.001, weight_decay=0, val_ratio=0.2)[source]

Train scDeepsort model.

Parameters:
  • graph (DGLGraph) – Training graph.

  • labels (Tensor) – Node (cell, gene) labels, -1 for genes.

  • epochs (int) – Number of epochs to train the model.

  • lr (float) – Learning rate.

  • weight_decay (float) – Weight decay regularization strength.

  • val_ratio (float) – Ratio of the training data to hold out for validation.

load_model()[source]

Load the model from the model path.

predict(graph, unsure_rate=2.0, return_unsure=False)[source]

Perform prediction on all test datasets.

Parameters:
  • graph (DGLGraph) – Input cell-gene grahp to be predicted.

  • unsure_rate (float) – Determine the threshold of the maximum predicted probability under which the predictions are considered uncertain.

  • return_unsure (bool) – If set to True, then return an indicator array that indicates whether a prediction is uncertain.

predict_proba(graph)[source]

Perform inference on a test dataset.

Parameters:

graph (DGLGraph) – Input cell-gene graph to be predicted.

Returns:

2-D array of predicted probabilities of the cell-types, where rows are cells and columns are cell-types.

Return type:

np.ndarray

save_model()[source]

Save the model at the save_path.

class dance.modules.single_modality.cell_type_annotation.SingleCellNet(num_trees=100)[source]

The SingleCellNet model.

Parameters:

num_trees (int) – Number of trees in the random forest model.

fit(x, y, num_rand=100, stratify=True, random_state=100)[source]

Train the SingleCellNet random forest model.

Parameters:
  • x – Input features.

  • y – Labels.

  • stratify (bool) – Whether we select balanced class weight in the random forest model.

  • random_state (Optional[int]) – Random state.

  • num_rand (int) –

predict(x)[source]

Predict cell type label.

Parameters:

x – Input features.

Returns:

The most likely cell-type label of each sample.

Return type:

np.ndarray

predict_proba(x)[source]

Calculate predicted probabilities.

Parameters:

x – Input featurex.

Returns:

Cell-type probability matrix where each row is a cell and each column is a cell-type. The values in the matrix indicate the predicted probability that the cell is a particular cell-type. The last column corresponds to the probability that the model could not confidently identify the cell type of the cell.

Return type:

np.ndarray

randomize(exp, num=50)[source]

Return randomized features.

Parameters:
  • exp – Data to be shuffled.

  • num (int) – Number of samples to return.

Clustering

class dance.modules.single_modality.clustering.GraphSC(agg='sum', activation='relu', in_feats=50, n_hidden=1, hidden_dim=200, hidden_1=300, hidden_2=0, dropout=0.1, n_layers=1, hidden_relu=False, hidden_bn=False, n_clusters=10, cluster_method='kmeans', num_workers=1, device='auto')[source]

GraphSC class.

Parameters:
  • agg (str) – Aggregation layer.

  • activation (str) – Activation function.

  • in_feats (int) – Dimension of input feature

  • n_hidden (int) – Number of hidden layer.

  • hidden_dim (int) – Input dimension of hidden layer 1.

  • hidden_1 (int) – Output dimension of hidden layer 1.

  • hidden_2 (int) – Output dimension of hidden layer 2.

  • dropout (float) – Dropout rate.

  • n_layers (int) – Number of graph convolutional layers.

  • hidden_relu (bool) – Use relu activation in hidden layers or not.

  • hidden_bn (bool) – Use batch norm in hidden layers or not.

  • cluster_method (Literal['kmeans', 'leiden']) – Method for clustering.

  • num_workers (int) – Number of workers.

  • device (str) – Computation device to use.

  • n_clusters (int) –

fit(g, y=None, *, epochs=100, lr=1e-05, batch_size=128, show_epoch_ari=False, eval_epoch=False)[source]

Train graph-sc.

Parameters:
  • g (DGLGraph) – Input cell-gene graph.

  • y (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

  • epochs (int) – Number of epochs.

  • lr (float) – Learning rate.

  • batch_size (int) – Batch size.

  • show_epoch_ari (bool) – Show ARI score for each epoch

  • eval_epoch (bool) – Evaluate every epoch.

predict(x=None)[source]

Get predictions from the graph autoencoder model.

Parameters:

x (Optional[Any]) – Not used, for compatibility with BaseClusteringMethod class.

Returns:

Prediction of given clustering method.

Return type:

pred

class dance.modules.single_modality.clustering.ScDCC(input_dim, z_dim, n_clusters, encodeLayer, decodeLayer, activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, ml_weight=1.0, cl_weight=1.0, device='auto', pretrain_path=None)[source]

ScDCC class.

Parameters:
  • input_dim (int) – Dimension of encoder input.

  • z_dim (int) – Dimension of embedding.

  • n_clusters (int) – Number of clusters.

  • encodeLayer (List[int]) – Dimensions of encoder layers.

  • decodeLayer (List[int]) – Dimensions of decoder layers.

  • activation (str) – Activation function.

  • sigma (float) – Parameter of Gaussian noise.

  • alpha (float) – Parameter of soft assign.

  • gamma (float) – Parameter of cluster loss.

  • ml_weight (float) – Parameter of must-link loss.

  • cl_weight (float) – Parameter of cannot-link loss.

  • device (str) – Computation device.

  • pretrain_path (str | None) –

cluster_loss(p, q)[source]

Calculate cluster loss.

Parameters:
  • p – Target distribution.

  • q – Soft label.

Returns:

Return type:

Cluster loss.

encodeBatch(X, batch_size=256)[source]

Batch encoder.

Parameters:
  • X – Input features.

  • batch_size – Size of batch.

Returns:

Return type:

Embedding.

fit(inputs, y=None, ml_ind1=array([], dtype=float64), ml_ind2=array([], dtype=float64), cl_ind1=array([], dtype=float64), cl_ind2=array([], dtype=float64), ml_p=1.0, cl_p=1.0, lr=1.0, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]

Train model.

Parameters:
  • inputs (Tuple[ndarray, ndarray, ndarray]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.

  • y (Optional[ndarray]) – True label. Used for model selection.

  • ml_ind1 (ndarray) – Index 1 of must-link pairs.

  • ml_ind2 (ndarray) – Index 2 of must-link pairs.

  • cl_ind1 (ndarray) – Index 1 of cannot-link pairs.

  • cl_ind2 (ndarray) – Index 2 of cannot-link pairs.

  • ml_p (float) – Parameter of must-link loss.

  • cl_p (float) – Parameter of cannot-link loss.

  • lr (float) – Learning rate.

  • batch_size (int) – Size of batch.

  • epochs (int) – Number of epochs.

  • update_interval (int) – Update interval of soft label and target distribution.

  • tol (float) – Tolerance for training loss.

  • pt_batch_size (int) – Pretrain batch size.

  • pt_lr (float) – Pretrain learning rate.

  • pt_epochs (int) – Pretrain epochs.

forward(x)[source]

Forward propagation.

Parameters:

x – Input features.

Returns:

  • z0 – Embedding.

  • q – Soft label.

  • _mean – Data mean from ZINB.

  • _disp – Data dispersion from ZINB.

  • _pi – Data dropout probability from ZINB.

pairwise_loss(p1, p2, cons_type)[source]

Calculate pairwise loss.

Parameters:
  • p1 – Distribution 1.

  • p2 – Distribution 2.

  • cons_type – Type of loss.

Returns:

Return type:

Pairwise loss.

predict(x=None)[source]

Get predictions from the trained model.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted clustering assignment for each cell.

Return type:

pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted probability for each cell.

Return type:

pred_prop

pretrain(x, X_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]

Pretrain autoencoder.

Parameters:
  • x – Input features.

  • X_raw – Raw input features.

  • n_counts – Total counts for each cell.

  • batch_size – Size of batch.

  • lr – Learning rate.

  • epochs – Number of epochs.

soft_assign(z)[source]

Soft assign q with z.

Parameters:

z – Embedding.

Returns:

Soft label.

Return type:

q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:

q – Soft label.

Returns:

Target distribution.

Return type:

p

class dance.modules.single_modality.clustering.ScDSC(pretrain_path, sigma=1, n_enc_1=512, n_enc_2=256, n_enc_3=256, n_dec_1=256, n_dec_2=256, n_dec_3=512, n_z1=256, n_z2=128, n_z3=32, n_clusters=100, n_input=10, v=1, device='auto')[source]

ScDSC wrapper class.

Parameters:
  • pretrain_path (str) – Path of saved autoencoder weights.

  • sigma (float) – Balance parameter.

  • n_enc_1 (int) – Output dimension of encoder layer 1.

  • n_enc_2 (int) – Output dimension of encoder layer 2.

  • n_enc_3 (int) – Output dimension of encoder layer 3.

  • n_dec_1 (int) – Output dimension of decoder layer 1.

  • n_dec_2 (int) – Output dimension of decoder layer 2.

  • n_dec_3 (int) – Output dimension of decoder layer 3.

  • n_z1 (int) – Output dimension of hidden layer 1.

  • n_z2 (int) – Output dimension of hidden layer 2.

  • n_z3 (int) – Output dimension of hidden layer 3.

  • n_clusters (int) – Number of clusters.

  • n_input (int) – Input feature dimension.

  • v (float) – Parameter of soft assignment.

  • device (str) – Computing device.

fit(inputs, y, lr=0.001, epochs=300, bcl=0.1, cl=0.01, rl=1, zl=0.1, pt_epochs=200, pt_batch_size=256, pt_lr=0.001)[source]

Train model.

Parameters:
  • inputs (Tuple[spmatrix, ndarray, ndarray, Series]) – A tuple containing (1) the adjacency matrix, (2) the input features, (3) the raw input features, and (4) the total counts for each cell.

  • y (ndarray) – Label.

  • lr (float) – Learning rate.

  • epochs (int) – Number of epochs.

  • bcl (float) – Parameter of binary crossentropy loss.

  • cl (float) – Parameter of Kullback–Leibler divergence loss.

  • rl (float) – Parameter of reconstruction loss.

  • zl (float) – Parameter of ZINB loss.

  • pt_epochs (int) –

  • pt_batch_size (int) –

  • pt_lr (float) –

predict(x=None)[source]

Get predictions from the trained model.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted clustering assignment for each cell.

Return type:

pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted probability for each cell.

Return type:

pred_prop

pretrain(x, batch_size=256, epochs=200, lr=0.001)[source]

Pretrain autoencoder.

Parameters:
  • x – Input features.

  • batch_size – Size of batch.

  • epochs – Number of epochs.

  • lr – Learning rate.

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:

q – Soft label.

Returns:

Target distribution.

Return type:

p

class dance.modules.single_modality.clustering.ScDeepCluster(input_dim, z_dim, encodeLayer=[], decodeLayer=[], activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, device='cuda', pretrain_path=None)[source]

ScDeepCluster class.

Parameters:
  • input_dim – Dimension of encoder input.

  • z_dim – Dimension of embedding.

  • encodeLayer – Dimensions of encoder layers.

  • decodeLayer – Dimensions of decoder layers.

  • activation – Activation function.

  • sigma – Parameter of Gaussian noise.

  • alpha – Parameter of soft assign.

  • gamma – Parameter of cluster loss.

  • device – Computing device.

  • pretrain_path (Optional[str]) – Path to pretrained weights.

cluster_loss(p, q)[source]

Calculate cluster loss.

Parameters:
  • p – Target distribution.

  • q – Soft label.

Returns:

Cluster loss.

Return type:

loss

encodeBatch(x, batch_size=256)[source]

Batch encoder.

Parameters:
  • x – Input features.

  • batch_size – Size of batch.

Returns:

Embedding.

Return type:

encoded

fit(inputs, y, n_clusters=10, init_centroid=None, y_pred_init=None, lr=1, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]

Train model.

Parameters:
  • inputs (Tuple[ndarray, ndarray, ndarray]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.

  • y (ndarray) – True label. Used for model selection.

  • n_clusters (int) – Number of clusters.

  • init_centroid (Optional[List[int]]) – Initialization of centroids. If None, perform kmeans to initialize cluster centers.

  • y_pred_init (Optional[List[int]]) – Predicted label for initialization.

  • lr (float) – Learning rate.

  • batch_size (int) – Size of batch.

  • epochs (int) – Number of epochs.

  • update_interval (int) – Update interval of soft label and target distribution.

  • tol (float) – Tolerance for training loss.

  • pt_batch_size (int) – Pretraining batch size.

  • pt_lr (float) – Pretraining learning rate.

  • pt_epochs (int) – pretraining epochs.

forward(x)[source]

Forward propagation.

Parameters:

x – Input features.

Returns:

  • z0 – Embedding.

  • q – Soft label.

  • _mean – Data mean from ZINB.

  • _disp – Data dispersion from ZINB.

  • _pi – Data dropout probability from ZINB.

forwardAE(x)[source]

Forward propagation of autoencoder.

Parameters:

x – Input features.

Returns:

  • z0 – Embedding.

  • _mean – Data mean from ZINB.

  • _disp – Data dispersion from ZINB.

  • _pi – Data dropout probability from ZINB.

load_model(path)[source]

Load model from path.

Parameters:

path – Path to load model.

predict(x=None)[source]

Get predictions from the trained model.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted clustering assignment for each cell.

Return type:

pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.

Returns:

Predicted probability for each cell.

Return type:

pred_prop

pretrain(x, x_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]

Pretrain autoencoder.

Parameters:
  • x – Input features.

  • x_raw – Raw input features.

  • n_counts – Total counts for each cell.

  • batch_size – Size of batch.

  • lr – Learning rate.

  • epochs – Number of epochs.

save_model(path)[source]

Save model to path.

Parameters:

path – Path to save model.

soft_assign(z)[source]

Soft assign q with z.

Parameters:

z – Embedding.

Returns:

Soft label.

Return type:

q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:

q – Soft label.

Returns:

Target distribution.

Return type:

p

class dance.modules.single_modality.clustering.ScTAG(n_clusters, k=3, hidden_dim=128, latent_dim=15, dec_dim=None, dropout=0.2, device='cuda', alpha=1.0, pretrain_path=None)[source]

The scTAG clustering model.

Parameters:
  • n_clusters (int) – Number of clusters.

  • k (int) – Number of hops of TAG convolutional layer.

  • hidden_dim (int) – Dimension of hidden layer.

  • latent_dim (int) – Dimension of latent embedding.

  • dec_dim (Optional[int]) – Dimensions of decoder layers.

  • dropout (float) – Dropout rate.

  • device (str) – Computing device.

  • alpha (float) – Parameter of soft assign.

  • pretrain_path (Optional[str]) – Path to save the pretrained autoencoder. If not specified, then do not save/load.

fit(inputs, y, *, epochs=300, pretrain_epochs=200, lr=0.0005, w_a=0.3, w_x=1, w_c=1.5, w_d=0, info_step=1, max_dist=20, min_dist=0.5, force_pretrain=False)[source]

Pretrain autoencoder.

Parameters:
  • inputs (Tuple[ndarray, ndarray, ndarray, ndarray]) – A tuple containing the adjacency matrix, the input feature, the raw input feature, and the total counts per cell array.

  • epochs (int) – Number of epochs.

  • lr (float) – Learning rate.

  • w_a (float) – Parameter of reconstruction loss.

  • w_x (float) – Parameter of ZINB loss.

  • w_c (float) – Parameter of clustering loss.

  • w_d (float) – Parameter of pairwise distance loss.

  • info_step (int) – Interval of showing pretraining loss.

  • min_dist (float) – Minimum distance of pairwise distance loss.

  • max_dist (float) – Maximum distance of pairwise distance loss.

  • force_pretrain (bool) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.

  • y (ndarray) –

  • pretrain_epochs (int) –

forward(g, x_input)[source]

Forward propagation.

Parameters:
  • g – Input graph.

  • x_input – Input features.

Returns:

  • adj_out – Reconstructed adjacency matrix.

  • z – Embedding.

  • q – Soft label.

  • _mean – Data mean from ZINB.

  • _disp – Data dispersion from ZINB.

  • _pi – Data dropout probability from ZINB.

init_model(adj, x)[source]

Initialize model.

Parameters:
  • adj (ndarray) –

  • x (ndarray) –

predict(x=None)[source]

Get predictions from the trained model.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the base module class.

Returns:

Prediction of given clustering method.

Return type:

pred

predict_proba(x=None)[source]

Get predicted probabilities for each cell.

Parameters:

x (Optional[Any]) – Not used, for compatibility with the base module class.

Returns:

Predicted probabilities for each cell.

Return type:

pred_prob

pretrain(adj, x, x_raw, n_counts, *, epochs=1000, info_step=10, lr=0.0005, w_a=0.3, w_x=1, w_d=0, min_dist=0.5, max_dist=20, force_pretrain=False)[source]

Pretrain autoencoder.

Parameters:
  • adj – Adjacency matrix.

  • x – Input features.

  • x_raw – Raw input features.

  • n_counts – Total counts for each cell.

  • epochs (int) – Number of epochs.

  • info_step (int) – Interval of showing pretraining loss.

  • lr (float) – Learning rate.

  • w_a (float) – Parameter of reconstruction loss.

  • w_x (float) – Parameter of ZINB loss.

  • w_d (float) – Parameter of pairwise distance loss.

  • min_dist (float) – Minimum distance of pairwise distance loss.

  • max_dist (float) – Maximum distance of pairwise distance loss.

  • force_pretrain (bool) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.

soft_assign(z)[source]

Soft assign q with z.

Parameters:

z – Embedding.

Returns:

Soft label.

Return type:

q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:

q – Soft label.

Returns:

Target distribution.

Return type:

p

Imputation

class dance.modules.single_modality.imputation.DeepImpute(predictors, targets, dataset, sub_outputdim=512, hidden_dim=256, dropout=0.2, seed=1, gpu=-1)[source]

DeepImpute class.

Parameters:
  • learning_rate (float optional) – learning rate

  • batch_size (int optional) – batch size

  • max_epochs (int optional) – maximum epochs

  • patience (int optional) – number of epochs before stopping once loss stops to improve

  • gpu (int optional) – option to use gpu

  • loss (string optional) – loss function

  • output_prefix (string optinal) – directory to save outputs

  • sub_outputdim (int optional) – output dimensions in each subnetwork

  • hidden_dim (int optional) – dimension of the dense layer in each subnetwork

  • verbose (int optional) – verbose option

  • seed (int optional) – random seed

  • architecture (optional) – network architecture

  • imputed_only (boolean optional) – whether to return imputed genes only

  • policy (string optional) – imputation policy

build(inputdims, outputdims, device='cpu')[source]

Build model.

Parameters:

inputdims (int) – number of neurons as input in the first layer

Returns:

models – array of subnetworks

Return type:

array

fit(X, Y, mask=None, batch_size=64, lr=0.001, n_epochs=100, patience=5, train_idx=None)[source]

Train model.

Parameters:
  • X_train (optional) – Training data including input genes

  • Y_train (optional) – Training data including target genes to be inputed

  • X_valid (optional) – Validation data including input predictor genes

  • Y_valid (optional) – Validation data including target genes to be inputed

  • predictors (array optional) – input genes as predictors for target genes

Returns:

Return type:

None

load_model(model, i)[source]

Load model.

Parameters:
  • model – model to be loaded

  • i (int) – index of the subnetwork to be loaded

Returns:

loaded model

Return type:

model

predict(X_test, mask=None, test_idx=None, predict_raw=False)[source]

Get predictions from the trained model.

Parameters:

targetgenes (array optional) – genes to be imputed

Returns:

imputed – imputed gene expression

Return type:

DataFrame

save_model(model, optimizer, i)[source]

Save model.

Parameters:
  • model – model to be saved

  • optimizer – optimizer

  • i (int) – index of the subnetwork to be loaded

Returns:

Return type:

None

score(true_expr, imputed_expr, mask=None, metric='MSE', test_idx=None)[source]

Scoring function of model.

Parameters:
  • true_expr – True underlying expression values

  • imputed_expr – Imputed expression values

  • test_idx – index of testing cells

  • metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’

Returns:

evaluation score

Return type:

score

wMSE(y_true, y_pred, binary=False)[source]

Weighted MSE.

Parameters:
  • y_true (array) – true expression

  • Y_train (array) – predicted expression

  • binary (boolean optional) – whether to use binary weights

Returns:

val – weighted MSE

Return type:

float

class dance.modules.single_modality.imputation.GraphSCI(num_cells, num_genes, dataset, dropout=0.1, gpu=-1, seed=1)[source]

GraphSCI model, combination AE and GNN.

Parameters:
  • num_cells (int) – number of cells in expression data

  • num_genes (int) – number of genes in expression data

  • dataset (str) – name of training dataset

  • n_epochs (int optional) – number of training epochs

  • lr (float optional) – learning rate

  • weight_decay (float optional) – weight decay rate

  • dropout (float optional) – probability of weight dropout for training

  • gpu (int optional) – index of computing device, -1 for cpu.

evaluate(features, features_raw, graph, mask=None, le=1, la=1, ke=1, ka=1)[source]

Evaluate function, returns loss and reconstructions of expression and adjacency.

Parameters:
  • features – input features

  • features_raw – input raw features

  • adj_norm – normalized adjacency matrix of gene graph

  • adj_orig – training adjacency matrix of gene graph

  • size_factors – cell size factors for reconstruction

  • le (float optioanl) – parameter of expression loss

  • la (float optioanl) – parameter of adjacency loss

  • ke (float optioanl) – parameter of KL divergence of expression

  • ka (float optioanl) – parameter of KL divergence of adjacency

fit(train_data, train_data_raw, graph, mask=None, le=1, la=1, ke=1, ka=1, n_epochs=100, lr=0.001, weight_decay=1e-05, train_idx=None)[source]

Data fitting function.

Parameters:
  • train_data – input training features

  • train_data_raw – input raw training features

  • adj_train – training adjacency matrix of gene graph

  • train_size_factors – train size factors for cells

  • adj_norm_train – normalized training adjacency matrix of gene graph

  • le (float optioanl) – parameter of expression loss

  • la (float optioanl) – parameter of adjacency loss

  • ke (float optioanl) – parameter of KL divergence of expression

  • ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

Return type:

None

get_loss(batch, adj_orig, z_adj, z_adj_log_std, z_adj_mean, z_exp, mean, disp, pi, mask, le=1, la=1, ke=1, ka=1)[source]

Loss function for GraphSCI.

Parameters:
  • batch – batch features

  • z_adj – reconstructed adjacency matrix

  • z_adj_std – standard deviation of distribution of z_adj

  • z_adj_mean – mean of distributino of z_adj

  • z_exp – recontruction of expression values

  • mean – dropout parameter of ZINB dist of z_exp

  • disp – dropout parameter of ZINB dist of z_exp

  • pi – dispersion parameter of ZINB dist of z_exp

  • sf – cell size factors

  • le (float optioanl) – parameter of expression loss

  • la (float optioanl) – parameter of adjacency loss

  • ke (float optioanl) – parameter of KL divergence of expression

  • ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

  • loss_adj (float) – loss of adjacency reconstruction

  • loss_exp (float) – loss of expression reconstruction

  • log_lik (float) – log likelihood loss value

  • kl (float) – kullback leibler loss

  • loss (float) – log_lik - kl

load_model()[source]

Load function.

predict(data, data_raw, graph, mask=None)[source]

Predict function.

Parameters:
  • data – input true expression data

  • data_raw – raw input true expression data

  • adj_norm – normalized adjacency matrix of gene graph

  • adj_orig – adjacency matrix of gene graph

  • size_factors – cell size factors for reconstruction

Returns:

reconstructed expression data

Return type:

z_exp

save_model()[source]

Save model function, saves both AE and GNN.

score(true_expr, imputed_expr, mask=None, metric='MSE', log1p=True, test_idx=None)[source]

Scoring function of model.

Parameters:
  • true_expr – True underlying expression values

  • imputed_expr – Imputed expression values

  • test_idx – index of testing cells

  • metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’

Returns:

evaluation score

Return type:

score

train(train_data, train_data_raw, graph, train_mask, valid_mask, le=1, la=1, ke=1, ka=1)[source]

Train function, gets loss and performs optimization step.

Parameters:
  • train_data – input training features

  • train_data_raw – input raw training features

  • adj_orig – training adjacency matrix of gene graph

  • size_factors – train size factors for cells

  • adj_norm – normalized training adjacency matrix of gene graph

  • le (float optioanl) – parameter of expression loss

  • la (float optioanl) – parameter of adjacency loss

  • ke (float optioanl) – parameter of KL divergence of expression

  • ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

total_loss – loss value of training loop

Return type:

float