Single modality tasks

Cell type annotation

class dance.modules.single_modality.cell_type_annotation.ACTINN(*, hidden_dims=(100, 50, 25), lambd=0.01, device='cpu', random_seed=None)[source]

The ACTINN cell-type classification model.

Parameters:

hidden_dims (Tuple[int, ...]) – Hidden layer dimensions.
lambd (float) – Regularization parameter
device (str) – Training device
random_seed (int | None) –

compute_loss(z, y)[source]

Compute loss function.

Parameters:

z (Tensor) – Output of forward propagation (cells by cell-types).
y (Tensor) – Cell labels (cells).

Returns:

Loss.

Return type:

torch.Tensor

fit(x_train, y_train, *, batch_size=128, lr=0.01, num_epochs=50, print_cost=False, seed=None)[source]

Fit the classifier.

Parameters:

x_train (Tensor) – training data (cells by genes).
y_train (Tensor) – training labels (cells by cell-types).
batch_size (int) – Training batch size.
lr (float) – Initial learning rate.
num_epochs (int) – Number of epochs to run.
print_cost (bool) – Print training loss if set to True.
seed (Optional[int]) – Random seed, if set to None, then random.

predict(x)[source]

Predict cell labels.

Parameters:: x (Tensor) – Gene expression input features (cells by genes).
Returns:: Predicted cell-label indices.
Return type:: torch.Tensor

random_batches(x, y, batch_size=32, seed=None)[source]

Shuffle data and split into batches.

Parameters:

x (Tensor) – Training data (cells by genes).
y (Tensor) – True labels (cells by cell-types).
batch_size (int) –
seed (int | None) –

Yields:

Tuple[torch.Tensor, torch.Tensor] – Batch of training data (x, y).

class dance.modules.single_modality.cell_type_annotation.Celltypist(majority_voting=False, clf=None, scaler=None, description=None)[source]

The CellTypist cell annotation method.

Parameters:: majority_voting (bool) – Whether to refine the predicted labels by running the majority voting classifier after over-clustering.

fit(indata, labels=None, C=1.0, solver=None, max_iter=1000, n_jobs=None, use_SGD=False, alpha=0.0001, mini_batch=False, batch_number=100, batch_size=1000, epochs=10, balance_cell_type=False, feature_selection=False, top_genes=300, **kwargs)[source]

Train a celltypist model using mini-batch (optional) logistic classifier with a global solver or stochastic gradient descent (SGD) learning.

Parameters:

indata (np.ndarray) – Input gene expression matrix (cell x gene).
labels (np.array) – 1-D numpy array indicating cell-type identities of each cell (in index of the cell-types).
C (float optional) – Inverse of L2 regularization strength for traditional logistic classifier. A smaller value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is enabled (use_SGD = True). (Default: 1.0)
solver (str optional) – Algorithm to use in the optimization problem for traditional logistic classifier. The default behavior is to choose the solver according to the size of the input data. This argument is ignored if SGD learning is enabled (use_SGD = True).
max_iter (int optional) – Maximum number of iterations before reaching the minimum of the cost function. Try to decrease max_iter if the cost function does not converge for a long time. This argument is for both traditional and SGD logistic classifiers, and will be ignored if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 1000)
n_jobs (int optional) – Number of CPUs used. Default to one CPU. -1 means all CPUs are used. This argument is for both traditional and SGD logistic classifiers.
use_SGD (bool optional) – Whether to implement SGD learning for the logistic classifier. (Default: False)
alpha (float optional) – L2 regularization strength for SGD logistic classifier. A larger value can possibly improve model generalization while at the cost of decreased accuracy. This argument is ignored if SGD learning is disabled (use_SGD = False). (Default: 0.0001)
mini_batch (bool optional) – Whether to implement mini-batch training for the SGD logistic classifier. Setting to True may improve the training efficiency for large datasets (for example, >100k cells). This argument is ignored if SGD learning is disabled (use_SGD = False). (Default: False)
batch_number (int optional) – The number of batches used for training in each epoch. Each batch contains batch_size cells. For datasets which cannot be binned into batch_number batches, all batches will be used. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 100)
batch_size (int optional) – The number of cells within each batch. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 1000)
epochs (int optional) – The number of epochs for the mini-batch training procedure. The default values of batch_number, batch_size, and epochs together allow observing ~10^6 training cells. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: 10)
balance_cell_type (bool optional) – Whether to balance the cell type frequencies in mini-batches during each epoch. Setting to True will sample rare cell types with a higher probability, ensuring close-to-even cell type distributions in mini-batches. This argument is relevant only if mini-batch SGD training is conducted (use_SGD = True and mini_batch = True). (Default: False)
feature_selection (bool optional) – Whether to perform two-pass data training where the first round is used for selecting important features/genes using SGD learning. If True, the training time will be longer. (Default: False)
top_genes (int optional) – The number of top genes selected from each class/cell-type based on their absolute regression coefficients. The final feature set is combined across all classes (i.e., union). (Default: 300)
**kwargs – Other keyword arguments passed to LogisticRegression (use_SGD = False) or SGDClassifier (use_SGD = True).

Returns:

An instance of the Model trained by celltypist.

Return type:

Model

predict(x, as_obj=False, over_clustering=None, min_prop=0)[source]

Run the prediction and (optional) majority voting to annotate the input dataset.

Parameters:

x (np.ndarray) – Input expression matrix (cell x gene).
as_obj (bool) – If set to True, then return the prediction results are AnnotationResult. Otherwise, return the predicted cell-label indexes ad 1-d numpy array instead. (Default: False)
over_clustering (Union[str, list, tuple, np.ndarray, pd.Series, pd.Index] optional) –
This argument can be provided in several ways: 1) an input plain file with the over-clustering result of one cell per line. 2) a string key specifying an existing metadata column in the AnnData (pre-created by the user). 3) a python list, tuple, numpy array, pandas series or index representing the over-clustering result of the

input cells.
1. if none of the above is provided, will use a heuristic over-clustering approach according to the size of input data.
Ignored if majority_voting is set to False.
min_prop (float optional) – For the dominant cell type within a subcluster, the minimum proportion of cells required to support naming of the subcluster by this cell type. Ignored if majority_voting is set to False. Subcluster that fails to pass this proportion threshold will be assigned 'Heterogeneous'. (Default: 0)

Return type:

Union[ndarray, AnnotationResult]

class dance.modules.single_modality.cell_type_annotation.SVM(args, prj_path='./', random_state=None)[source]

The SVM cell-type classification model.

Parameters:

args (argparse.Namespace) – A Namespace contains arguments of SVM. See parser help document for more info.
prj_path (str) – project path
random_state (int | None) –

fit(x, y)[source]

Train the classifier.

Parameters:

x (ndarray) – Training cell features.
y (ndarray) – Training labels.

predict(x)[source]

Predict cell labels.

Parameters:: x (ndarray) – Samples to be predicted (samplex x features).
Returns:: Predicted labels of the input samples.
Return type:: y

save(num, pred)[source]

Save the predictions.

Parameters:

num (int) – test file name
pred (dict) – prediction labels

class dance.modules.single_modality.cell_type_annotation.ScDeepSort(dim_in, dim_hid, num_layers, species, tissue, *, dropout=0, batch_size=500, device='cpu')[source]

The ScDeepSort cell-type annotation model.

Parameters:

dim_in (int) – Input dimension, i.e., the number of PCA used for cell and gene features.
dim_hid (int) – Hidden dimension.
num_layers (int) – Number of convolution layers.
species (str) – Species name (only used for determining the read/write path).
tissue (str) – Tissue name (only used for determining the read/write path).
dropout (int) – Drop-out rate.
batch_size (int) – Batch size.
device (str) – Computation device, e.g., ‘cpu’, ‘cuda’.

cal_loss(graph, idx)[source]

Calculate loss.

Parameters:

graph (DGLGraph) – Input cell-gene graph object.
idx (Tensor) – 1-D tensor containing the indexes of the cell nodes to calculate the loss.

Returns:

Averaged loss over all batches.

Return type:

float

evaluate(graph, idx, unsure_rate=2.0)[source]

Evaluate the model on certain cell nodes.

Parameters:

idx (Tensor) – 1-D tensor containing the indexes of the cell nodes to be evaluated.
graph (DGLGraph) –
unsure_rate (float) –

Returns:

The total number of correct prediction, the total number of unsure prediction, and the accuracy score.

Return type:

Tuple[int, int, float]

fit(graph, labels, epochs=300, lr=0.001, weight_decay=0, val_ratio=0.2)[source]

Train scDeepsort model.

Parameters:

graph (DGLGraph) – Training graph.
labels (Tensor) – Node (cell, gene) labels, -1 for genes.
epochs (int) – Number of epochs to train the model.
lr (float) – Learning rate.
weight_decay (float) – Weight decay regularization strength.
val_ratio (float) – Ratio of the training data to hold out for validation.

load_model()[source]: Load the model from the model path.

predict(graph, unsure_rate=2.0, return_unsure=False)[source]

Perform prediction on all test datasets.

Parameters:

graph (DGLGraph) – Input cell-gene grahp to be predicted.
unsure_rate (float) – Determine the threshold of the maximum predicted probability under which the predictions are considered uncertain.
return_unsure (bool) – If set to True, then return an indicator array that indicates whether a prediction is uncertain.

predict_proba(graph)[source]

Perform inference on a test dataset.

Parameters:: graph (DGLGraph) – Input cell-gene graph to be predicted.
Returns:: 2-D array of predicted probabilities of the cell-types, where rows are cells and columns are cell-types.
Return type:: np.ndarray

save_model()[source]: Save the model at the save_path.

class dance.modules.single_modality.cell_type_annotation.SingleCellNet(num_trees=100)[source]

The SingleCellNet model.

Parameters:: num_trees (int) – Number of trees in the random forest model.

fit(x, y, num_rand=100, stratify=True, random_state=100)[source]

Train the SingleCellNet random forest model.

Parameters:

x – Input features.
y – Labels.
stratify (bool) – Whether we select balanced class weight in the random forest model.
random_state (Optional[int]) – Random state.
num_rand (int) –

predict(x)[source]

Predict cell type label.

Parameters:: x – Input features.
Returns:: The most likely cell-type label of each sample.
Return type:: np.ndarray

predict_proba(x)[source]

Calculate predicted probabilities.

Parameters:: x – Input featurex.
Returns:: Cell-type probability matrix where each row is a cell and each column is a cell-type. The values in the matrix indicate the predicted probability that the cell is a particular cell-type. The last column corresponds to the probability that the model could not confidently identify the cell type of the cell.
Return type:: np.ndarray

randomize(exp, num=50)[source]

Return randomized features.

Parameters:

exp – Data to be shuffled.
num (int) – Number of samples to return.

Clustering

class dance.modules.single_modality.clustering.GraphSC(agg='sum', activation='relu', in_feats=50, n_hidden=1, hidden_dim=200, hidden_1=300, hidden_2=0, dropout=0.1, n_layers=1, hidden_relu=False, hidden_bn=False, n_clusters=10, cluster_method='kmeans', num_workers=1, device='auto')[source]

GraphSC class.

Parameters:

agg (str) – Aggregation layer.
activation (str) – Activation function.
in_feats (int) – Dimension of input feature
n_hidden (int) – Number of hidden layer.
hidden_dim (int) – Input dimension of hidden layer 1.
hidden_1 (int) – Output dimension of hidden layer 1.
hidden_2 (int) – Output dimension of hidden layer 2.
dropout (float) – Dropout rate.
n_layers (int) – Number of graph convolutional layers.
hidden_relu (bool) – Use relu activation in hidden layers or not.
hidden_bn (bool) – Use batch norm in hidden layers or not.
cluster_method (Literal['kmeans', 'leiden']) – Method for clustering.
num_workers (int) – Number of workers.
device (str) – Computation device to use.
n_clusters (int) –

fit(g, y=None, *, epochs=100, lr=1e-05, batch_size=128, show_epoch_ari=False, eval_epoch=False)[source]

Train graph-sc.

Parameters:

g (DGLGraph) – Input cell-gene graph.
y (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
epochs (int) – Number of epochs.
lr (float) – Learning rate.
batch_size (int) – Batch size.
show_epoch_ari (bool) – Show ARI score for each epoch
eval_epoch (bool) – Evaluate every epoch.

predict(x=None)[source]

Get predictions from the graph autoencoder model.

Parameters:: x (Optional[Any]) – Not used, for compatibility with BaseClusteringMethod class.
Returns:: Prediction of given clustering method.
Return type:: pred

class dance.modules.single_modality.clustering.ScDCC(input_dim, z_dim, n_clusters, encodeLayer, decodeLayer, activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, ml_weight=1.0, cl_weight=1.0, device='auto', pretrain_path=None)[source]

ScDCC class.

Parameters:

input_dim (int) – Dimension of encoder input.
z_dim (int) – Dimension of embedding.
n_clusters (int) – Number of clusters.
encodeLayer (List[int]) – Dimensions of encoder layers.
decodeLayer (List[int]) – Dimensions of decoder layers.
activation (str) – Activation function.
sigma (float) – Parameter of Gaussian noise.
alpha (float) – Parameter of soft assign.
gamma (float) – Parameter of cluster loss.
ml_weight (float) – Parameter of must-link loss.
cl_weight (float) – Parameter of cannot-link loss.
device (str) – Computation device.
pretrain_path (str | None) –

cluster_loss(p, q)[source]

Calculate cluster loss.

Parameters:

p – Target distribution.
q – Soft label.

Returns:

Return type:

Cluster loss.

encodeBatch(X, batch_size=256)[source]

Batch encoder.

Parameters:

X – Input features.
batch_size – Size of batch.

Returns:

Return type:

Embedding.

fit(inputs, y=None, ml_ind1=array([], dtype=float64), ml_ind2=array([], dtype=float64), cl_ind1=array([], dtype=float64), cl_ind2=array([], dtype=float64), ml_p=1.0, cl_p=1.0, lr=1.0, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]

Train model.

Parameters:

inputs (Tuple[ndarray, ndarray, ndarray]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.
y (Optional[ndarray]) – True label. Used for model selection.
ml_ind1 (ndarray) – Index 1 of must-link pairs.
ml_ind2 (ndarray) – Index 2 of must-link pairs.
cl_ind1 (ndarray) – Index 1 of cannot-link pairs.
cl_ind2 (ndarray) – Index 2 of cannot-link pairs.
ml_p (float) – Parameter of must-link loss.
cl_p (float) – Parameter of cannot-link loss.
lr (float) – Learning rate.
batch_size (int) – Size of batch.
epochs (int) – Number of epochs.
update_interval (int) – Update interval of soft label and target distribution.
tol (float) – Tolerance for training loss.
pt_batch_size (int) – Pretrain batch size.
pt_lr (float) – Pretrain learning rate.
pt_epochs (int) – Pretrain epochs.

forward(x)[source]

Forward propagation.

Parameters:

x – Input features.

Returns:

z0 – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.

pairwise_loss(p1, p2, cons_type)[source]

Calculate pairwise loss.

Parameters:

p1 – Distribution 1.
p2 – Distribution 2.
cons_type – Type of loss.

Returns:

Return type:

Pairwise loss.

predict(x=None)[source]

Get predictions from the trained model.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted clustering assignment for each cell.
Return type:: pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted probability for each cell.
Return type:: pred_prop

pretrain(x, X_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]

Pretrain autoencoder.

Parameters:

x – Input features.
X_raw – Raw input features.
n_counts – Total counts for each cell.
batch_size – Size of batch.
lr – Learning rate.
epochs – Number of epochs.

soft_assign(z)[source]

Soft assign q with z.

Parameters:: z – Embedding.
Returns:: Soft label.
Return type:: q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:: q – Soft label.
Returns:: Target distribution.
Return type:: p

class dance.modules.single_modality.clustering.ScDSC(pretrain_path, sigma=1, n_enc_1=512, n_enc_2=256, n_enc_3=256, n_dec_1=256, n_dec_2=256, n_dec_3=512, n_z1=256, n_z2=128, n_z3=32, n_clusters=100, n_input=10, v=1, device='auto')[source]

ScDSC wrapper class.

Parameters:

pretrain_path (str) – Path of saved autoencoder weights.
sigma (float) – Balance parameter.
n_enc_1 (int) – Output dimension of encoder layer 1.
n_enc_2 (int) – Output dimension of encoder layer 2.
n_enc_3 (int) – Output dimension of encoder layer 3.
n_dec_1 (int) – Output dimension of decoder layer 1.
n_dec_2 (int) – Output dimension of decoder layer 2.
n_dec_3 (int) – Output dimension of decoder layer 3.
n_z1 (int) – Output dimension of hidden layer 1.
n_z2 (int) – Output dimension of hidden layer 2.
n_z3 (int) – Output dimension of hidden layer 3.
n_clusters (int) – Number of clusters.
n_input (int) – Input feature dimension.
v (float) – Parameter of soft assignment.
device (str) – Computing device.

fit(inputs, y, lr=0.001, epochs=300, bcl=0.1, cl=0.01, rl=1, zl=0.1, pt_epochs=200, pt_batch_size=256, pt_lr=0.001)[source]

Train model.

Parameters:

inputs (Tuple[spmatrix, ndarray, ndarray, Series]) – A tuple containing (1) the adjacency matrix, (2) the input features, (3) the raw input features, and (4) the total counts for each cell.
y (ndarray) – Label.
lr (float) – Learning rate.
epochs (int) – Number of epochs.
bcl (float) – Parameter of binary crossentropy loss.
cl (float) – Parameter of Kullback–Leibler divergence loss.
rl (float) – Parameter of reconstruction loss.
zl (float) – Parameter of ZINB loss.
pt_epochs (int) –
pt_batch_size (int) –
pt_lr (float) –

predict(x=None)[source]

Get predictions from the trained model.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted clustering assignment for each cell.
Return type:: pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted probability for each cell.
Return type:: pred_prop

pretrain(x, batch_size=256, epochs=200, lr=0.001)[source]

Pretrain autoencoder.

Parameters:

x – Input features.
batch_size – Size of batch.
epochs – Number of epochs.
lr – Learning rate.

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:: q – Soft label.
Returns:: Target distribution.
Return type:: p

class dance.modules.single_modality.clustering.ScDeepCluster(input_dim, z_dim, encodeLayer=[], decodeLayer=[], activation='relu', sigma=1.0, alpha=1.0, gamma=1.0, device='cuda', pretrain_path=None)[source]

ScDeepCluster class.

Parameters:

input_dim – Dimension of encoder input.
z_dim – Dimension of embedding.
encodeLayer – Dimensions of encoder layers.
decodeLayer – Dimensions of decoder layers.
activation – Activation function.
sigma – Parameter of Gaussian noise.
alpha – Parameter of soft assign.
gamma – Parameter of cluster loss.
device – Computing device.
pretrain_path (Optional[str]) – Path to pretrained weights.

cluster_loss(p, q)[source]

Calculate cluster loss.

Parameters:

p – Target distribution.
q – Soft label.

Returns:

Cluster loss.

Return type:

loss

encodeBatch(x, batch_size=256)[source]

Batch encoder.

Parameters:

x – Input features.
batch_size – Size of batch.

Returns:

Embedding.

Return type:

encoded

fit(inputs, y, n_clusters=10, init_centroid=None, y_pred_init=None, lr=1, batch_size=256, epochs=10, update_interval=1, tol=0.001, pt_batch_size=256, pt_lr=0.001, pt_epochs=400)[source]

Train model.

Parameters:

inputs (Tuple[ndarray, ndarray, ndarray]) – A tuple containing (1) the input features, (2) the raw input features, and (3) the total counts per cell.
y (ndarray) – True label. Used for model selection.
n_clusters (int) – Number of clusters.
init_centroid (Optional[List[int]]) – Initialization of centroids. If None, perform kmeans to initialize cluster centers.
y_pred_init (Optional[List[int]]) – Predicted label for initialization.
lr (float) – Learning rate.
batch_size (int) – Size of batch.
epochs (int) – Number of epochs.
update_interval (int) – Update interval of soft label and target distribution.
tol (float) – Tolerance for training loss.
pt_batch_size (int) – Pretraining batch size.
pt_lr (float) – Pretraining learning rate.
pt_epochs (int) – pretraining epochs.

forward(x)[source]

Forward propagation.

Parameters:

x – Input features.

Returns:

z0 – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.

forwardAE(x)[source]

Forward propagation of autoencoder.

Parameters:

x – Input features.

Returns:

z0 – Embedding.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.

load_model(path)[source]

Load model from path.

Parameters:: path – Path to load model.

predict(x=None)[source]

Get predictions from the trained model.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted clustering assignment for each cell.
Return type:: pred

predict_proba(x=None)[source]

Get the predicted propabilities for each cell.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the BaseClusteringMethod class.
Returns:: Predicted probability for each cell.
Return type:: pred_prop

pretrain(x, x_raw, n_counts, batch_size=256, lr=0.001, epochs=400)[source]

Pretrain autoencoder.

Parameters:

x – Input features.
x_raw – Raw input features.
n_counts – Total counts for each cell.
batch_size – Size of batch.
lr – Learning rate.
epochs – Number of epochs.

save_model(path)[source]

Save model to path.

Parameters:: path – Path to save model.

soft_assign(z)[source]

Soft assign q with z.

Parameters:: z – Embedding.
Returns:: Soft label.
Return type:: q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:: q – Soft label.
Returns:: Target distribution.
Return type:: p

class dance.modules.single_modality.clustering.ScTAG(n_clusters, k=3, hidden_dim=128, latent_dim=15, dec_dim=None, dropout=0.2, device='cuda', alpha=1.0, pretrain_path=None)[source]

The scTAG clustering model.

Parameters:

n_clusters (int) – Number of clusters.
k (int) – Number of hops of TAG convolutional layer.
hidden_dim (int) – Dimension of hidden layer.
latent_dim (int) – Dimension of latent embedding.
dec_dim (Optional[int]) – Dimensions of decoder layers.
dropout (float) – Dropout rate.
device (str) – Computing device.
alpha (float) – Parameter of soft assign.
pretrain_path (Optional[str]) – Path to save the pretrained autoencoder. If not specified, then do not save/load.

fit(inputs, y, *, epochs=300, pretrain_epochs=200, lr=0.0005, w_a=0.3, w_x=1, w_c=1.5, w_d=0, info_step=1, max_dist=20, min_dist=0.5, force_pretrain=False)[source]

Pretrain autoencoder.

Parameters:

inputs (Tuple[ndarray, ndarray, ndarray, ndarray]) – A tuple containing the adjacency matrix, the input feature, the raw input feature, and the total counts per cell array.
epochs (int) – Number of epochs.
lr (float) – Learning rate.
w_a (float) – Parameter of reconstruction loss.
w_x (float) – Parameter of ZINB loss.
w_c (float) – Parameter of clustering loss.
w_d (float) – Parameter of pairwise distance loss.
info_step (int) – Interval of showing pretraining loss.
min_dist (float) – Minimum distance of pairwise distance loss.
max_dist (float) – Maximum distance of pairwise distance loss.
force_pretrain (bool) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.
y (ndarray) –
pretrain_epochs (int) –

forward(g, x_input)[source]

Forward propagation.

Parameters:

g – Input graph.
x_input – Input features.

Returns:

adj_out – Reconstructed adjacency matrix.
z – Embedding.
q – Soft label.
_mean – Data mean from ZINB.
_disp – Data dispersion from ZINB.
_pi – Data dropout probability from ZINB.

init_model(adj, x)[source]

Initialize model.

Parameters:

adj (ndarray) –
x (ndarray) –

predict(x=None)[source]

Get predictions from the trained model.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the base module class.
Returns:: Prediction of given clustering method.
Return type:: pred

predict_proba(x=None)[source]

Get predicted probabilities for each cell.

Parameters:: x (Optional[Any]) – Not used, for compatibility with the base module class.
Returns:: Predicted probabilities for each cell.
Return type:: pred_prob

pretrain(adj, x, x_raw, n_counts, *, epochs=1000, info_step=10, lr=0.0005, w_a=0.3, w_x=1, w_d=0, min_dist=0.5, max_dist=20, force_pretrain=False)[source]

Pretrain autoencoder.

Parameters:

adj – Adjacency matrix.
x – Input features.
x_raw – Raw input features.
n_counts – Total counts for each cell.
epochs (int) – Number of epochs.
info_step (int) – Interval of showing pretraining loss.
lr (float) – Learning rate.
w_a (float) – Parameter of reconstruction loss.
w_x (float) – Parameter of ZINB loss.
w_d (float) – Parameter of pairwise distance loss.
min_dist (float) – Minimum distance of pairwise distance loss.
max_dist (float) – Maximum distance of pairwise distance loss.
force_pretrain (bool) – If set to True, then pre-train the model even if the pre-training has been done already, or even the pre-trained model file is available to load.

soft_assign(z)[source]

Soft assign q with z.

Parameters:: z – Embedding.
Returns:: Soft label.
Return type:: q

target_distribution(q)[source]

Calculate auxiliary target distribution p with q.

Parameters:: q – Soft label.
Returns:: Target distribution.
Return type:: p

Imputation

class dance.modules.single_modality.imputation.DeepImpute(predictors, targets, dataset, sub_outputdim=512, hidden_dim=256, dropout=0.2, seed=1, gpu=-1)[source]

DeepImpute class.

Parameters:

learning_rate (float optional) – learning rate
batch_size (int optional) – batch size
max_epochs (int optional) – maximum epochs
patience (int optional) – number of epochs before stopping once loss stops to improve
gpu (int optional) – option to use gpu
loss (string optional) – loss function
output_prefix (string optinal) – directory to save outputs
sub_outputdim (int optional) – output dimensions in each subnetwork
hidden_dim (int optional) – dimension of the dense layer in each subnetwork
verbose (int optional) – verbose option
seed (int optional) – random seed
architecture (optional) – network architecture
imputed_only (boolean optional) – whether to return imputed genes only
policy (string optional) – imputation policy

build(inputdims, outputdims, device='cpu')[source]

Build model.

Parameters:: inputdims (int) – number of neurons as input in the first layer
Returns:: models – array of subnetworks
Return type:: array

fit(X, Y, mask=None, batch_size=64, lr=0.001, n_epochs=100, patience=5, train_idx=None)[source]

Train model.

Parameters:

X_train (optional) – Training data including input genes
Y_train (optional) – Training data including target genes to be inputed
X_valid (optional) – Validation data including input predictor genes
Y_valid (optional) – Validation data including target genes to be inputed
predictors (array optional) – input genes as predictors for target genes

Returns:

Return type:

None

load_model(model, i)[source]

Load model.

Parameters:

model – model to be loaded
i (int) – index of the subnetwork to be loaded

Returns:

loaded model

Return type:

model

predict(X_test, mask=None, test_idx=None, predict_raw=False)[source]

Get predictions from the trained model.

Parameters:: targetgenes (array optional) – genes to be imputed
Returns:: imputed – imputed gene expression
Return type:: DataFrame

save_model(model, optimizer, i)[source]

Save model.

Parameters:

model – model to be saved
optimizer – optimizer
i (int) – index of the subnetwork to be loaded

Returns:

Return type:

None

score(true_expr, imputed_expr, mask=None, metric='MSE', test_idx=None)[source]

Scoring function of model.

Parameters:

true_expr – True underlying expression values
imputed_expr – Imputed expression values
test_idx – index of testing cells
metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’

Returns:

evaluation score

Return type:

score

wMSE(y_true, y_pred, binary=False)[source]

Weighted MSE.

Parameters:

y_true (array) – true expression
Y_train (array) – predicted expression
binary (boolean optional) – whether to use binary weights

Returns:

val – weighted MSE

Return type:

float

class dance.modules.single_modality.imputation.GraphSCI(num_cells, num_genes, dataset, dropout=0.1, gpu=-1, seed=1)[source]

GraphSCI model, combination AE and GNN.

Parameters:

num_cells (int) – number of cells in expression data
num_genes (int) – number of genes in expression data
dataset (str) – name of training dataset
n_epochs (int optional) – number of training epochs
lr (float optional) – learning rate
weight_decay (float optional) – weight decay rate
dropout (float optional) – probability of weight dropout for training
gpu (int optional) – index of computing device, -1 for cpu.

evaluate(features, features_raw, graph, mask=None, le=1, la=1, ke=1, ka=1)[source]

Evaluate function, returns loss and reconstructions of expression and adjacency.

Parameters:

features – input features
features_raw – input raw features
adj_norm – normalized adjacency matrix of gene graph
adj_orig – training adjacency matrix of gene graph
size_factors – cell size factors for reconstruction
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency

fit(train_data, train_data_raw, graph, mask=None, le=1, la=1, ke=1, ka=1, n_epochs=100, lr=0.001, weight_decay=1e-05, train_idx=None)[source]

Data fitting function.

Parameters:

train_data – input training features
train_data_raw – input raw training features
adj_train – training adjacency matrix of gene graph
train_size_factors – train size factors for cells
adj_norm_train – normalized training adjacency matrix of gene graph
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

Return type:

None

get_loss(batch, adj_orig, z_adj, z_adj_log_std, z_adj_mean, z_exp, mean, disp, pi, mask, le=1, la=1, ke=1, ka=1)[source]

Loss function for GraphSCI.

Parameters:

batch – batch features
z_adj – reconstructed adjacency matrix
z_adj_std – standard deviation of distribution of z_adj
z_adj_mean – mean of distributino of z_adj
z_exp – recontruction of expression values
mean – dropout parameter of ZINB dist of z_exp
disp – dropout parameter of ZINB dist of z_exp
pi – dispersion parameter of ZINB dist of z_exp
sf – cell size factors
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

loss_adj (float) – loss of adjacency reconstruction
loss_exp (float) – loss of expression reconstruction
log_lik (float) – log likelihood loss value
kl (float) – kullback leibler loss
loss (float) – log_lik - kl

load_model()[source]: Load function.

predict(data, data_raw, graph, mask=None)[source]

Predict function.

Parameters:

data – input true expression data
data_raw – raw input true expression data
adj_norm – normalized adjacency matrix of gene graph
adj_orig – adjacency matrix of gene graph
size_factors – cell size factors for reconstruction

Returns:

reconstructed expression data

Return type:

z_exp

save_model()[source]: Save model function, saves both AE and GNN.

score(true_expr, imputed_expr, mask=None, metric='MSE', log1p=True, test_idx=None)[source]

Scoring function of model.

Parameters:

true_expr – True underlying expression values
imputed_expr – Imputed expression values
test_idx – index of testing cells
metric – Choice of scoring metric - ‘RMSE’ or ‘ARI’

Returns:

evaluation score

Return type:

score

train(train_data, train_data_raw, graph, train_mask, valid_mask, le=1, la=1, ke=1, ka=1)[source]

Train function, gets loss and performs optimization step.

Parameters:

train_data – input training features
train_data_raw – input raw training features
adj_orig – training adjacency matrix of gene graph
size_factors – train size factors for cells
adj_norm – normalized training adjacency matrix of gene graph
le (float optioanl) – parameter of expression loss
la (float optioanl) – parameter of adjacency loss
ke (float optioanl) – parameter of KL divergence of expression
ka (float optioanl) – parameter of KL divergence of adjacency

Returns:

total_loss – loss value of training loop

Return type:

float