dance.transforms

class dance.transforms.base.BaseTransform(out=None, log_level='WARNING')[source]

BaseTransform abstract object.

Parameters:

log_level (Literal['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR']) – Logging level.
out (Optional[str]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.

hexdigest()[source]

Return MD5 hash using the representation of the transform object.

Return type:: str

class dance.transforms.AnnDataTransform(func, **kwargs)[source]

AnnData transformation interface object.

This object provides an interface with any function that apply in-place transformation to an AnnData object.

Example

Any one of the scanpy.pp functions should be supported. For example, we can use the scanpy.pp.normalize_total() function on the dance data object as follows

>>> AnnDataTransform(scanpy.pp.normalize_total, target_sum=10000)(data)

where data is a dance data object, e.g., dance.data.Data. Calling the above function is effectively equivalent to calling

>>> scanpy.pp.normalize_total(data.data, target_sum=10000)

Parameters:: func (Callable | str) –

__init__(func, **kwargs)[source]

Initialize the AnnDataTransform object.

Parameters:

func (Union[Callable, str]) – In-place AnnData transformation function, e.g., any one of the scanpy.pp functions.
**kwargs – Keyword arguments for the transformation function.

class dance.transforms.BatchFeature(*, channel=None, mod=None, **kwargs)[source]

Assign statistical batch features for each cell.

Parameters:

channel (str | None) –
mod (str | None) –

class dance.transforms.CellGiottoTopicProfile(*, ct_select='auto', ct_key='cellType', split_name=None, channel=None, channel_type='X', detection_threshold=-1, **kwargs)[source]

Giotto cell topic profile.

Reference

https://rubd.github.io/Giotto_site/reference/findGiniMarkers_one_vs_all.html

Parameters:

ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
detection_threshold (float) –

class dance.transforms.CellPCA(n_components=400, *, channel=None, mod=None, **kwargs)[source]

Reduce cell feature matrix with PCA.

Parameters:

n_components (int) – Number of PCA components to use.
channel (str | None) –
mod (str | None) –

class dance.transforms.CellSVD(n_components=400, *, channel=None, mod=None, **kwargs)[source]

Reduce cell feature matrix with SVD.

Parameters:

n_components (int) – Number of SVD components to take.
channel (str | None) –
mod (str | None) –

class dance.transforms.CellTopicProfile(*, ct_select='auto', ct_key='cellType', batch_key=None, split_name=None, channel=None, channel_type='X', method='median', **kwargs)[source]

Cell topic profile.

Parameters:

ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
batch_key (str | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
method (Literal['median', 'mean']) –

class dance.transforms.CellwiseMaskData(distr='exp', mask_rate=0.1, seed=None, min_gene_counts=5, **kwargs)[source]

Randomly mask data in a cell-wise approach.

For every cell that has more than 5 positive counts, mask positive counts according to masking rate and probabiliy generated from distribution.

Parameters:

distr (Optional[Literal['exp', 'uniform']]) – Distribution to generate masks.
mask_rate (Optional[float]) – Masking rate.
seed (Optional[int]) – Random seed.
Min_gene_counts – Minimum number of genes expressed within a below which we do not mask that cell.
min_gene_counts (int) –

class dance.transforms.Compose(*transforms, use_master_log_level=True, **kwargs)[source]

Compose transformation by combining several transfomration objects.

Parameters:

transforms (Tuple[BaseTransform, ...]) – Transformation objects.
use_master_log_level (bool) – If set to True, then reset all transforms’ loggers to use :then reset all transforms’ loggers to use log_level option passed to this Compose object.

Notes

The order in which the transform object are passed will be exactly the order in which they will be applied to the data object.

hexdigest()[source]

Return MD5 hash using the representation of the transform object.

Return type:: str

class dance.transforms.FilterCellsScanpy(min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]

Scanpy filtering cell transformation with additional options.

Allow passing gene counts as ratio

Parameters:

min_counts (Optional[int]) – Minimum number of counts required for a cell to be kept.
min_genes (Union[float, int, None]) – Minimum number (or ratio) of genes required for a cell to be kept.
max_counts (Optional[int]) – Maximum number of counts required for a cell to be kept.
max_genes (Union[float, int, None]) – Maximum number (or ratio) of genes required for a cell to be kept.
split_name (Optional[str]) – Which split to be used for filtering.
channel (Optional[str]) – Channel to be used for filtering.
channel_type (Optional[str]) – Channel type to be used for filtering.

class dance.transforms.FilterGenesCommon(batch_key=None, split_keys=None, **kwargs)[source]

Filter genes by taking the common genes across batches or splits.

Parameters:

batch_key (Optional[str]) – Which column in the .obs table to be used to distinguishing batches.
split_keys (Optional[List[str]]) – A list of split names, e.g., ‘train’, to be used to find common gnees.

Note

One and only one of batch_key or split_keys can be specified.

class dance.transforms.FilterGenesMarker(*, ct_profile_channel='CellTopicProfile', subset=True, label=None, threshold=1.25, eps=1e-06, **kwargs)[source]

Select marker genes based on log fold-change.

Parameters:

ct_profile_channel (str) – Name of the .varm channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).
subset (bool) – If set to True, then inplace subset the variables to only contain the markers.
label (Optional[str]) – If set, e.g., to 'marker', then save the marker indicator to the obs column named as marker.
threshold (float) – Threshold value of the log fol-change above which the gene will be considered as a marker.
eps (float) – A small value that prevents taking log of zeros.

class dance.transforms.FilterGenesMarkerGini(*, ct_profile_channel='CellGiottoTopicProfile', ct_profile_detection_channel='CellGiottoDetectionTopicProfile', subset=True, label=None, **kwargs)[source]

Select marker genes based on Gini coefficient.

Identfy marker genes for all clusters in a one vs all manner based on Gini coefficients, a measure for inequality.

Parameters:

ct_profile_channel (str) – Name of the .varm channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).
ct_profile_detection_channel (str) – Name of the .varm channel that contains the cell-topic profile nums which greater than some value which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).
subset (bool) – If set to True, then inplace subset the variables to only contain the markers.
label (Optional[str]) – If set, e.g., to 'marker', then save the marker indicator to the obs column named as marker.
Reference –
--------- –
https (//genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1010-4?ref=https://githubhelp.com) –

class dance.transforms.FilterGenesMatch(prefixes=None, suffixes=None, case_sensitive=False, **kwargs)[source]

Filter genes based on prefixes and suffixes.

Parameters:

prefixes (Optional[List[str]]) – List of prefixes to remove.
suffixes (Optional[List[str]]) – List of suffixes to remove.
case_sensitive (bool) –

class dance.transforms.FilterGenesPercentile(min_val=1, max_val=99, *, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, **kwargs)[source]

Filter genes based on percentiles of the summarized gene expressions.

Parameters:

min_val (Optional[float]) – Minimum percentile of the summarized expression value below which the genes will be discarded.
max_val (Optional[float]) – Maximum percentile of the summarized expression value above which the genes will be discarded.
mode (Literal['sum', 'cv', 'rv', 'var']) – Summarization mode. Available options are [sum|var|cv|rv]. sum calculates the sum of expression values, var calculates the variance of the expression values, cv uses the coefficient of variation (std / mean ), and rv uses the relative variance (var / mean).
channel (Optional[str]) – Which channel, more specificailly, layers, to use. Use the default .X if not set. If channel is specified, then need to specify channel_type to be layers as well.
channel_type (Optional[str]) – Type of channels specified. Only allow None (the default setting) or layers (when channel is specified).
whitelist_indicators (Union[List[str], str, None]) – A list of (or a single) var columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.

class dance.transforms.FilterGenesRegression(method, num_genes=400, *, channel=None, mod=None, skip_count_check=False, **kwargs)[source]

Select genes based on regression.

Parameters:

method (str) – What regression based gene selection methtod to use. Supported options are: "enclasc", "seurat3", and "scmap".
num_genes (int) – Number of genes to select.
channel (str | None) –
mod (str | None) –
skip_count_check (bool) –

Note

The implementation is adapted from the EnClaSC GitHub repo: https://github.com/xy-chen16/EnClaSC

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z

class dance.transforms.FilterGenesScanpy(min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]

Scanpy filtering gene transformation with additional options.

Parameters:

min_counts (Optional[int]) – Minimum number of counts required for a gene to be kept.
min_cells (Union[float, int, None]) – Minimum number (or ratio) of cells required for a gene to be kept.
max_counts (Optional[int]) – Maximum number of counts required for a gene to be kept.
max_cells (Union[float, int, None]) – Maximum number (or ratio) of cells required for a gene to be kept.
split_name (Optional[str]) – Which split to be used for filtering.
channel (Optional[str]) – Channel to be used for filtering.
channel_type (Optional[str]) – Channel type to be used for filtering.

class dance.transforms.FilterGenesTopK(num_genes, top=True, *, mode='cv', channel=None, channel_type='X', whitelist_indicators=None, **kwargs)[source]

Select top/bottom genes based on the summarized gene expressions.

Parameters:

num_genes (int) – Number of genes to be selected.
top (bool) – If set to True, then use the genes with highest values of the specified gene summary stats.
mode (Literal['sum', 'cv', 'rv', 'var']) – Summarization mode. Available options are [sum|var|cv|rv]. sum calculates the sum of expression values, var calculates the variance of the expression values, cv uses the coefficient of variation (std / mean ), and rv uses the relative variance (var / mean).
channel (Optional[str]) – Which channel, more specificailly, layers, to use. Use the default .X if not set. If channel is specified, then need to specify channel_type to be layers as well.
channel_type (Optional[str]) – Type of channels specified. Only allow None (the default setting) or layers (when channel is specified).
whitelist_indicators (Union[List[str], str, None]) – A list of (or a single) var columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.

class dance.transforms.FilterScanpy(min_counts=None, min_genes_or_cells=None, max_counts=None, max_genes_or_cells=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]

Scanpy filtering transformation with additional options.

Parameters:

min_counts (int | None) –
min_genes_or_cells (float | int | None) –
max_counts (int | None) –
max_genes_or_cells (float | int | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –

class dance.transforms.GeneHoldout(n_top=5, batch_size=512, random_state=None, **kwargs)[source]

Progressively hold out genes for DeepImpute.

Split genes into target batches. For every target gene in one batch, refer to the genes that are not in this batch and select predictor genes with high covariance with target gene.

Parameters:

n_top (int) – Number of predictor genes per target gene.
batch_size (int) – Target batch size.
random_state (Optional[int]) – Random state.

class dance.transforms.GeneStats(genestats_select='all', *, fill_na=None, threshold=0, pseudo=False, split_name='train', channel=None, channel_type=None, **kwargs)[source]

Gene statistics computation.

Parameters:

genestats_select (Union[str, List[str]]) – List of names of the gene stats functions to use. If set to "all" (by default), then use all available gene stats functions.
fill_na (Optional[float]) – If not set (default), then do not fill nans. Otherwise, fill nans with the specified value.
threshold (float) – Threshold value for filtering gene expression when computing stats, e.g., mean expression values.
pseudo (bool) – If set to True, then add 1 to the numerator and denominator when computing the ratio (alpha) for which the gene expression values are above the specified threshold.
split_name (Optional[str]) – Which split to compute the gene stats on.
channel (str | None) –
channel_type (str | None) –

class dance.transforms.MaskData(mask_rate=0.1, seed=None, **kwargs)[source]

Randomly mask data.

Randomly mask positive counts according to masking rate.

Parameters:

mask_rate (Optional[float]) – Masking rate.
seed (Optional[int]) – Random seed.

class dance.transforms.MorphologyFeatureCNN(*, model_name='resnet50', n_components=50, random_state=0, crop_size=20, target_size=299, device='cpu', channels=('spatial_pixel', 'image'), channel_types=('obsm', 'uns'), **kwargs)[source]

Cell morphological features extracted from CNN.

Parameters:

model_name (str) – Pretrained CNN name: "resnet50", "inceptron_v3", "xception", "vgg16".
n_components (int) – Number of feature dimension.
crop_size (int) – Cell image cropping size (cropped as square centered around the target cell).
target_size (int) – Target patch size.
Reference –
--------- –
https (//doi.org/10.1101/2020.05.31.125658) –
random_state (int) –
device (str) –
channels (Sequence[str]) –
channel_types (Sequence[str]) –

class dance.transforms.PseudoMixture(*, n_pseudo=1000, nc_min=2, nc_max=10, ct_select='auto', ct_key='cellType', channel=None, channel_type='X', random_state=0, prefix='ps_mix_', in_split_name='ref', out_split_name='pseudo', label_batch=False, **kwargs)[source]

Pseudo mixture generation.

Parameters:

n_pseudo (int) –
nc_min (int) –
nc_max (int) –
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
channel (str | None) –
channel_type (str | None) –
random_state (int | None) –
prefix (str) –
in_split_name (str) –
out_split_name (str | None) –
label_batch (bool) –

class dance.transforms.RemoveSplit(*, split_name, **kwargs)[source]

Remove a particular split from the data.

Parameters:: split_name (str) –

class dance.transforms.SCNFeature(num_top_genes=10, alpha1=0.05, alpha2=0.001, mu=2, num_top_gene_pairs=25, max_gene_per_ct=3, *, split_name='train', channel=None, channel_type=None, **kwargs)[source]

Differential gene-pair feature used in SingleCellNet.

Parameters:

num_top_genes (int) –
alpha1 (float) –
alpha2 (float) –
mu (float) –
num_top_gene_pairs (int) –
max_gene_per_ct (int) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –

class dance.transforms.SMEFeature(n_neighbors=3, n_components=50, random_state=0, *, channels=(None, 'SMEGraph'), channel_types=(None, 'obsp'), **kwargs)[source]

Spatial Morphological gene Expression normalization feature from stLearn.

Parameters:

n_neighbors (int) – Number of spatial spots neighbors to consider.
n_components (int) – Number of feature dimension.
Reference –
--------- –
https (//doi.org/10.1101/2020.05.31.125658) –
random_state (int) –
channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –

class dance.transforms.SaveRaw(exist_ok=False, **kwargs)[source]

Save raw data.

See anndata.AnnData.raw() for more information.

Parameters:: exist_ok (bool) – If set to False, then raise an exception if the raw attribute is already set.

class dance.transforms.ScTransform(split_names=None, batch_key=None, min_cells=5, gmean_eps=1, n_genes=2000, n_cells=None, bin_size=500, bw_adjust=3, **kwargs)[source]

ScTransform normalization and variance stabiliation.

Note

This is a Python implementation adapted from https://github.com/atarashansky/SCTransformPy

Parameters:

split_names (Union[Literal['ALL'], List[str], None]) – Which split(s) to apply the transformation.
batch_key (Optional[str]) – Key for batch information.
min_cells (int) – Minimum number of cells the gene has to express in, below which that gene will be discarded.
gmean_eps (int) – Pseudocount.
n_genes (Optional[int]) – Maximum number of genes to use. Use all if set to None.
n_cells (Optional[int]) – maximum number of cells to use. Use all if set to None.
bin_size (int) – Number of genes a single bin contain.
bw_adjust (float) – Bandwidth adjusting parameter.
Reference –
--------- –
https (//genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1) –

class dance.transforms.ScaleFeature(*, axis=0, split_names=None, batch_key=None, mode='normalize', eps=-1, **kwargs)[source]

Scale the feature matrix in the AnnData object.

This is an extension of scanpy.pp.scale(), allowing split- or batch-wide scaling.

Parameters:

axis (int) – Axis along which the scaling is performed.
split_names (Union[Literal['ALL'], List[str], None]) – Indicate which splits to perform the scaling independently. If set to ‘ALL’, then go through all splits available in the data.
batch_key (Optional[str]) – Indicate which column in .obs to use as the batch index to guide the batch-wide scaling.
mode (Literal['normalize', 'standardize', 'minmax', 'l2']) – Scaling mode, see dance.utils.matrix.normalize() for more information.
eps (float) – Correction fact, see dance.utils.matrix.normalize() for more information.

Note

The order of checking split- or batch-wide scaling mode is: batch_key > split_names > None (i.e., all).

class dance.transforms.SetConfig(config_dict, **kwargs)[source]

Set configuration options of a dance data object.

Parameters:: config_dict (Dict[str, Any]) – Dance data object configuration dictionary. See set_config_from_dict().

class dance.transforms.SpatialIDEFeature(channels=(None, 'spatial'), channel_types=(None, 'obsm'), **kwargs)[source]

Spatial IDE feature.

The SpatialDE model is based on the assumption of normally distributed residual noise and independent observations across cells. There are two normalization steps:

Variance-stabilizing transformation for negative-binomial-distributed data (Anscombe’s transformation).

Regress log total count values out from the Anscombe-transformed expression values.

Reference

https://www.nature.com/articles/nmeth.4636#Sec2

regress_out(sample_info, expression_matrix, covariate_formula, design_formula='1', rcond=-1)[source]: Implementation of limma’s removeBatchEffect function.

stabilize(expression_matrix)[source]

Use Anscombes approximation to variance stabilize Negative Binomial data.

See https://f1000research.com/posters/4-1041 for motivation.

Assumes columns are samples, and rows are genes

Parameters:

channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –

class dance.transforms.WeightedFeaturePCA(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, **kwargs)[source]

Compute the weighted gene PCA as cell features.

Given a gene expression matrix of dimension (cell x gene), the gene PCA is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.

Parameters:

n_components (int) – Number of PCs to use.
split_name (Optional[str]) – Which split to use to compute the gene PCA. If not set, use all data.
feat_norm_mode (str | None) –
feat_norm_axis (int) –