dance.transforms

class dance.transforms.base.BaseTransform(out=None, log_level='WARNING')[source]

BaseTransform abstract object.

Parameters:
  • log_level (Literal['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR']) – Logging level.

  • out (Optional[str]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.

hexdigest()[source]

Return MD5 hash using the representation of the transform object.

Return type:

str

class dance.transforms.AlignMod(**kwargs)[source]

Aligning mods and metadata in multimodal data.

class dance.transforms.AnnDataAdaptor(transform, **data_init_kwargs)[source]

Adaptor for transforming AnnData instead of dance data object.

Example

Modify an AnnData object inplace

>>> AnnDataAdaptor(FilterGenes(mode="sum"))(adata)
class dance.transforms.AnnDataTransform(func, **kwargs)[source]

AnnData transformation interface object.

This object provides an interface with any function that apply in-place transformation to an AnnData object.

Example

Any one of the scanpy.pp functions should be supported. For example, we can use the scanpy.pp.normalize_total() function on the dance data object as follows

>>> AnnDataTransform(scanpy.pp.normalize_total, target_sum=10000)(data)

where data is a dance data object, e.g., dance.data.Data. Calling the above function is effectively equivalent to calling

>>> scanpy.pp.normalize_total(data.data, target_sum=10000)
Parameters:

func (Callable | str) –

__init__(func, **kwargs)[source]

Initialize the AnnDataTransform object.

Parameters:
  • func (Union[Callable, str]) – In-place AnnData transformation function, e.g., any one of the scanpy.pp functions.

  • **kwargs – Keyword arguments for the transformation function.

class dance.transforms.BaseTransform(out=None, log_level='WARNING')[source]

BaseTransform abstract object.

Parameters:
  • log_level (Literal['NOTSET', 'DEBUG', 'INFO', 'WARNING', 'ERROR']) – Logging level.

  • out (Optional[str]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.

hexdigest()[source]

Return MD5 hash using the representation of the transform object.

Return type:

str

class dance.transforms.BatchFeature(*, channel=None, mod=None, **kwargs)[source]

Assign statistical batch features for each cell.

Parameters:
  • channel (str | None) –

  • mod (str | None) –

class dance.transforms.CellGiottoTopicProfile(*, ct_select='auto', ct_key='cellType', split_name=None, channel=None, channel_type='X', detection_threshold=-1, **kwargs)[source]

Giotto cell topic profile.

References

https://rubd.github.io/Giotto_site/reference/findGiniMarkers_one_vs_all.html

Parameters:
  • ct_select (Literal['auto'] | ~typing.List[str]) –

  • ct_key (str) –

  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str) –

  • detection_threshold (float) –

class dance.transforms.CellPCA(n_components=400, *, channel=None, mod=None, save_info=False, svd_solver='auto', **kwargs)[source]

Reduce cell feature matrix with PCA.

Parameters:
  • n_components (Union[float, int]) – Number of PCA components to use.

  • channel (str | None) –

  • mod (str | None) –

  • save_info (bool) –

  • svd_solver (Literal['auto', 'full', 'arpack', 'randomized']) –

class dance.transforms.CellSVD(n_components=400, *, channel=None, mod=None, algorithm='randomized', save_info=True, **kwargs)[source]

Reduce cell feature matrix with SVD.

Parameters:
  • n_components (Union[float, int]) – Number of SVD components to take.

  • channel (str | None) –

  • mod (str | None) –

  • algorithm (Literal['arpack', 'randomized']) –

class dance.transforms.CellSparsePCA(n_components=400, *, channel=None, mod=None, **kwargs)[source]

Reduce cell feature matrix with SparsePCA.

Parameters:
  • n_components (Union[float, int]) – Number of SparsePCA components to use.

  • channel (str | None) –

  • mod (str | None) –

class dance.transforms.CellTopicProfile(*, ct_select='auto', ct_key='cellType', batch_key=None, split_name=None, channel=None, channel_type='X', method='median', **kwargs)[source]

Cell topic profile.

Parameters:
  • ct_select (Literal['auto'] | ~typing.List[str]) –

  • ct_key (str) –

  • batch_key (str | None) –

  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str) –

  • method (Literal['median', 'mean']) –

class dance.transforms.CellwiseMaskData(distr='exp', mask_rate=0.1, seed=None, min_gene_counts=5, add_test_mask=False, **kwargs)[source]

Randomly mask data in a cell-wise approach.

For every cell that has more than min_gene_counts positive counts, mask positive counts according to mask_rate and probability generated from the specified distribution.

The masked entries are assigned to validation and optionally test masks.

Parameters:
  • distr (Optional[Literal['exp', 'uniform']]) – Distribution to generate probabilities for masking counts. Higher counts might have different probabilities depending on the distribution.

  • mask_rate (Optional[float]) – Overall masking rate (proportion of positive counts to mask per cell).

  • seed (Optional[int]) – Random seed for reproducibility.

  • min_gene_counts (int) – Minimum number of positive counts within a cell below which we do not mask that cell.

  • add_test_mask (bool) – If True, the masked entries (determined by mask_rate) are further split into validation and test sets. Approximately 10% of the masked entries go to valid_mask, and the remaining 90% go to test_mask. If False, all masked entries go to valid_mask, and test_mask will be empty (all False).

  • **kwargs – Additional keyword arguments passed to the base class.

class dance.transforms.ColumnSumNormalize(*, axis=0, split_names=None, batch_key=None, mode='normalize', eps=-1, **kwargs)[source]

Scale the feature matrix in the AnnData object.

This is an extension of scanpy.pp.scale(), allowing split- or batch-wide scaling.

Parameters:
  • axis (int) – Axis along which the scaling is performed.

  • split_names (Union[Literal['ALL'], List[str], None]) – Indicate which splits to perform the scaling independently. If set to ‘ALL’, then go through all splits available in the data.

  • batch_key (Optional[str]) – Indicate which column in .obs to use as the batch index to guide the batch-wide scaling.

  • mode (Literal['normalize', 'standardize', 'minmax', 'l2']) – Scaling mode, see dance.utils.matrix.normalize() for more information.

  • eps (float) – Correction fact, see dance.utils.matrix.normalize() for more information.

Note

The order of checking split- or batch-wide scaling mode is: batch_key > split_names > None (i.e., all).

class dance.transforms.Compose(*transforms, use_master_log_level=True, **kwargs)[source]

Compose transformation by combining several transfomration objects.

Parameters:
  • transforms (Tuple[BaseTransform, ...]) – Transformation objects.

  • use_master_log_level (bool) – If set to True, then reset all transforms’ loggers to use :then reset all transforms’ loggers to use log_level option passed to this Compose object.

Notes

The order in which the transform object are passed will be exactly the order in which they will be applied to the data object.

hexdigest()[source]

Return MD5 hash using the representation of the transform object.

Return type:

str

transform_with_history(data)[source]

Apply all transformations sequentially and record intermediate results.

Parameters:

data (Data) – The data object to be transformed.

Returns:

A dictionary containing the original data and results after each transformation. Keys are transformation names or indices, values are the transformed data.

Return type:

Dict[str, Data]

class dance.transforms.FeatureCellPlaceHolder(n_components=400, *, channel=None, mod=None, **kwargs)[source]

Used as a placeholder to skip the process.

Parameters:
  • n_components (int) – it will not be used

  • channel (str | None) –

  • mod (str | None) –

class dance.transforms.FilterCellTransform(species='human', image_save_path=None, **kwargs)[source]
Parameters:
  • species (Literal['human', 'mouse']) –

  • image_save_path (str) –

class dance.transforms.FilterCellsCommonMod(mod1, mod2, sol=None, **kwargs)[source]

Initialize the FilterCellsCommonMod class.

Parameters:
  • mod1 (str) – Name of the first modality in the single-cell dataset.

  • mod2 (str) – Name of the second modality in the single-cell dataset.

  • sol (Optional[str], default=None) – Name of the optional solution dataset containing cell labels or annotations.

  • **kwargs (dict) – Additional keyword arguments passed to the base transformation class.

class dance.transforms.FilterCellsPlaceHolder(split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_genes=True, inplace=True, **kwargs)[source]

Used as a placeholder to skip the process.

Parameters:
  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str | None) –

class dance.transforms.FilterCellsScanpy(min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_genes=None, inplace=True, **kwargs)[source]

Scanpy filtering cell transformation with additional options.

Allow passing gene counts as ratio

Parameters:
  • min_counts (Union[float, int, None]) – Minimum number of counts required for a cell to be kept.

  • min_genes (Union[float, int, None]) – Minimum number (or ratio) of genes required for a cell to be kept.

  • max_counts (Union[float, int, None]) – Maximum number of counts required for a cell to be kept.

  • max_genes (Union[float, int, None]) – Maximum number (or ratio) of genes required for a cell to be kept.

  • split_name (Optional[str]) – Which split to be used for filtering.

  • channel (Optional[str]) – Channel to be used for filtering.

  • channel_type (Optional[str]) – Channel type to be used for filtering.

  • key_n_counts (Optional[str]) – The location to add n_counts(the total counts for each cell). If it is None, it will not be added.

  • key_n_genes (Optional[str]) – The location to add n_genes(the number of genes expressed for each cell). If it is None, it will not be added.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in varm

class dance.transforms.FilterCellsScanpyOrder(order=None, min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_genes=True, inplace=True, **kwargs)[source]

Scanpy filtering cell transformation with additional options.

Allow passing gene counts as ratio

Parameters:
  • order (Optional[List[str]]) – Order of (min_counts, min_cells, max_counts, max_cells). For example, ["min_counts", "min_genes", "max_counts", "max_genes"] or ["max_counts", "min_genes"]. If not set, will be set by default to ["min_counts", "min_genes", "max_counts", "max_genes"].

  • min_counts (Union[float, int, None]) – Minimum number of counts required for a cell to be kept.

  • min_genes (Union[float, int, None]) – Minimum number (or ratio) of genes required for a cell to be kept.

  • max_counts (Union[float, int, None]) – Maximum number of counts required for a cell to be kept.

  • max_genes (Union[float, int, None]) – Maximum number (or ratio) of genes required for a cell to be kept.

  • split_name (Optional[str]) – Which split to be used for filtering.

  • channel (Optional[str]) – Channel to be used for filtering.

  • channel_type (Optional[str]) – Channel type to be used for filtering.

  • add_n_counts – Whether to add n_counts, the total counts for each cell.

  • add_n_genes – Whether to add n_genes, the number of genes expressed for each cell.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in varm

class dance.transforms.FilterCellsType(cell_type_threshold=10, **kwargs)[source]

Filter cell types based on the threshold.

class dance.transforms.FilterGenes(*, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]

Filter genes based on the summarized gene expressions.

Parameters:
  • mode (Literal['sum', 'cv', 'rv', 'var']) –

  • channel (str | None) –

  • channel_type (str | None) –

  • whitelist_indicators (List[str] | str | None) –

class dance.transforms.FilterGenesCommon(batch_key=None, split_keys=None, **kwargs)[source]

Filter genes by taking the common genes across batches or splits.

Parameters:
  • batch_key (Optional[str]) – Which column in the .obs table to be used to distinguishing batches.

  • split_keys (Optional[List[str]]) – A list of split names, e.g., ‘train’, to be used to find common gnees.

Note

One and only one of batch_key or split_keys can be specified.

class dance.transforms.FilterGenesMarker(*, ct_profile_channel='CellTopicProfile', subset=True, label=None, threshold=1.25, eps=1e-06, **kwargs)[source]

Select marker genes based on log fold-change.

Parameters:
  • ct_profile_channel (str) – Name of the .varm channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).

  • subset (bool) – If set to True, then inplace subset the variables to only contain the markers.

  • label (Optional[str]) – If set, e.g., to 'marker', then save the marker indicator to the obs column named as marker.

  • threshold (float) – Threshold value of the log fol-change above which the gene will be considered as a marker.

  • eps (float) – A small value that prevents taking log of zeros.

class dance.transforms.FilterGenesMarkerGini(*, ct_profile_channel='CellGiottoTopicProfile', ct_profile_detection_channel='CellGiottoDetectionTopicProfile', subset=True, label=None, **kwargs)[source]

Select marker genes based on Gini coefficient.

Identfy marker genes for all clusters in a one vs all manner based on Gini coefficients, a measure for inequality.

Parameters:
  • ct_profile_channel (str) – Name of the .varm channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).

  • ct_profile_detection_channel (str) – Name of the .varm channel that contains the cell-topic profile nums which greater than some value which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).

  • subset (bool) – If set to True, then inplace subset the variables to only contain the markers.

  • label (Optional[str]) – If set, e.g., to 'marker', then save the marker indicator to the obs column named as marker.

References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1010-4?ref=https://githubhelp.com

class dance.transforms.FilterGenesMatch(prefixes=None, suffixes=None, case_sensitive=False, **kwargs)[source]

Filter genes based on prefixes and suffixes.

Parameters:
  • prefixes (Optional[List[str]]) – List of prefixes to remove.

  • suffixes (Optional[List[str]]) – List of suffixes to remove.

  • case_sensitive (bool) –

class dance.transforms.FilterGenesNumberPlaceHolder(channel=None, channel_type=None, **kwargs)[source]
class dance.transforms.FilterGenesPercentile(min_val=1, max_val=99, *, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]

Filter genes based on percentiles of the summarized gene expressions.

Parameters:
  • min_val (Optional[float]) – Minimum percentile of the summarized expression value below which the genes will be discarded.

  • max_val (Optional[float]) – Maximum percentile of the summarized expression value above which the genes will be discarded.

  • mode (Literal['sum', 'cv', 'rv', 'var']) – Summarization mode. Available options are [sum|var|cv|rv]. sum calculates the sum of expression values, var calculates the variance of the expression values, cv uses the coefficient of variation (std / mean ), and rv uses the relative variance (var / mean).

  • channel (Optional[str]) – Which channel, more specificailly, layers, to use. Use the default .X if not set. If channel is specified, then need to specify channel_type to be layers as well.

  • channel_type (Optional[str]) – Type of channels specified. Only allow None (the default setting) or layers (when channel is specified).

  • whitelist_indicators (Union[List[str], str, None]) – A list of (or a single) var columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.

  • add_n_counts – Whether to add n_counts, the total counts for each gene.

  • add_n_cells – Whether to add n_cells, the number of cells expressed for each gene.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm

class dance.transforms.FilterGenesPlaceHolder(split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]

Used as a placeholder to skip the process.

Parameters:
  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str | None) –

class dance.transforms.FilterGenesRegression(method='enclasc', num_genes=1000, *, channel=None, channel_type=None, mod=None, skip_count_check=False, inplace=True, **kwargs)[source]

Select genes based on regression.

Parameters:
  • method (str) – What regression based gene selection methtod to use. Supported options are: "enclasc", "seurat3", and "scmap".

  • num_genes (int) – Number of genes to select.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm

  • channel (str | None) –

  • channel_type (str | None) –

  • mod (str | None) –

  • skip_count_check (bool) –

Note

The implementation is adapted from the EnClaSC GitHub repo: https://github.com/xy-chen16/EnClaSC

References

https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z

class dance.transforms.FilterGenesScanpy(min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_cells=None, inplace=True, **kwargs)[source]

Scanpy filtering gene transformation with additional options.

Parameters:
  • min_counts (Union[float, int, None]) – Minimum number of counts required for a gene to be kept.

  • min_cells (Union[float, int, None]) – Minimum number (or ratio) of cells required for a gene to be kept.

  • max_counts (Union[float, int, None]) – Maximum number of counts required for a gene to be kept.

  • max_cells (Union[float, int, None]) – Maximum number (or ratio) of cells required for a gene to be kept.

  • split_name (Optional[str]) – Which split to be used for filtering.

  • channel (Optional[str]) – Channel to be used for filtering.

  • channel_type (Optional[str]) – Channel type to be used for filtering.

  • key_n_counts (Optional[str]) – The location to add n_counts(the total counts for each gene). If it is None, it will not be added.

  • key_n_cells (Optional[str]) – The location to add n_cells(the number of cells expressed for each gene). If it is None, it will not be added.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm

class dance.transforms.FilterGenesScanpyOrder(order=None, min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_cells=True, inplace=True, params_dict=None, **kwargs)[source]

Scanpy filtering gene transformation with additional options.

Parameters:
  • order (Optional[List[str]]) – Order of (min_counts, min_cells, max_counts, max_cells). For example, ["min_counts", "min_cells", "max_counts", "max_cells"] or ["max_counts", "min_cells"]. If not set, will be set by default to ["min_counts", "min_cells", "max_counts", "max_cells"].

  • min_counts (Union[float, int, None]) – Minimum number of counts required for a gene to be kept.

  • min_cells (Union[float, int, None]) – Minimum number (or ratio) of cells required for a gene to be kept.

  • max_counts (Union[float, int, None]) – Maximum number of counts required for a gene to be kept.

  • max_cells (Union[float, int, None]) – Maximum number (or ratio) of cells required for a gene to be kept.

  • split_name (Optional[str]) – Which split to be used for filtering.

  • channel (Optional[str]) – Channel to be used for filtering.

  • channel_type (Optional[str]) – Channel type to be used for filtering.

  • add_n_counts – Whether to add n_counts, the total counts for each gene.

  • add_n_cells – Whether to add n_cells, the number of cells expressed for each gene.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm

class dance.transforms.FilterGenesTopK(num_genes=1000, top=True, *, mode='cv', channel=None, channel_type='X', whitelist_indicators=None, add_n_counts=False, add_n_cells=False, inplace=True, **kwargs)[source]

Select top/bottom genes based on the summarized gene expressions.

Parameters:
  • num_genes (int) – Number of genes to be selected.

  • top (bool) – If set to True, then use the genes with highest values of the specified gene summary stats.

  • mode (Literal['sum', 'cv', 'rv', 'var']) – Summarization mode. Available options are [sum|var|cv|rv]. sum calculates the sum of expression values, var calculates the variance of the expression values, cv uses the coefficient of variation (std / mean ), and rv uses the relative variance (var / mean).

  • channel (Optional[str]) – Which channel, more specificailly, layers, to use. Use the default .X if not set. If channel is specified, then need to specify channel_type to be layers as well.

  • channel_type (Optional[str]) – Type of channels specified. Only allow None (the default setting) or layers (when channel is specified).

  • whitelist_indicators (Union[List[str], str, None]) – A list of (or a single) var columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.

  • add_n_counts – Whether to add n_counts, the total counts for each gene.

  • add_n_cells – Whether to add n_cells, the number of cells expressed for each gene.

  • inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm

class dance.transforms.FilterScanpy(min_counts=None, min_genes_or_cells=None, max_counts=None, max_genes_or_cells=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_genes_or_cells=None, inplace=True, **kwargs)[source]

Scanpy filtering transformation with additional options.

Parameters:
  • min_counts (float | int | None) –

  • min_genes_or_cells (float | int | None) –

  • max_counts (float | int | None) –

  • max_genes_or_cells (float | int | None) –

  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str | None) –

  • key_n_counts (str | None) –

  • key_n_genes_or_cells (str | None) –

class dance.transforms.GaussRandProjFeature(n_components=400, eps=0.1, **kwargs)[source]

Custom preprocessing to extract cell feature via Gaussian random projection.

Parameters:
  • n_components (int) –

  • eps (float) –

class dance.transforms.GeneHoldout(n_top=5, batch_size=512, random_state=None, **kwargs)[source]

Progressively hold out genes for DeepImpute.

Split genes into target batches. For every target gene in one batch, refer to the genes that are not in this batch and select predictor genes with high covariance with target gene.

Parameters:
  • n_top (int) – Number of predictor genes per target gene.

  • batch_size (int) – Target batch size.

  • random_state (Optional[int]) – Random state.

class dance.transforms.GeneStats(genestats_select='all', *, fill_na=None, threshold=0, pseudo=False, split_name='train', channel=None, channel_type=None, **kwargs)[source]

Gene statistics computation.

Parameters:
  • genestats_select (Union[str, List[str]]) – List of names of the gene stats functions to use. If set to "all" (by default), then use all available gene stats functions.

  • fill_na (Optional[float]) – If not set (default), then do not fill nans. Otherwise, fill nans with the specified value.

  • threshold (float) – Threshold value for filtering gene expression when computing stats, e.g., mean expression values.

  • pseudo (bool) – If set to True, then add 1 to the numerator and denominator when computing the ratio (alpha) for which the gene expression values are above the specified threshold.

  • split_name (Optional[str]) – Which split to compute the gene stats on.

  • channel (str | None) –

  • channel_type (str | None) –

class dance.transforms.HighlyVariableGenesLogarithmizedByMeanAndDisp(channel=None, channel_type=None, min_disp=0.5, max_disp=inf, min_mean=0.0125, max_mean=3, n_bins=20, flavor='seurat', subset=True, inplace=True, batch_key=None, **kwargs)[source]

Filter for highly variable genes based on mean and dispersion.

Parameters:
  • layer – If provided, then use data.data.layers[layer]` for expression values instead of the default data.data.X.

  • min_mean (Optional[float]) – min_mean

  • max_mean (Optional[float]) – max_mean

  • min_disp (Optional[float]) – min_disp

  • max_disp (Optional[float]) – max_disp

  • n_bins (int) – Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4.

  • flavor (Literal['seurat', 'cell_ranger']) – Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.

  • subset (bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

  • inplace (bool) – Whether to place calculated metrics in .var or return them.

  • batch_key (Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.

  • channel (str | None) –

  • channel_type (str | None) –

See also

This

https

//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html

class dance.transforms.HighlyVariableGenesLogarithmizedByTopGenes(channel=None, channel_type=None, n_top_genes=1000, n_bins=20, flavor='seurat', subset=True, inplace=True, batch_key=None, **kwargs)[source]

Filter for highly variable genes based on top genes.

Parameters:
  • layer – If provided, then use data.data.layers[layer]` for expression values instead of the default data.data.X.

  • n_top_genes (Optional[int]) – Number of highly-variable genes to keep.

  • n_bins (int) – Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4.

  • flavor (Literal['seurat', 'cell_ranger']) – Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.

  • subset (bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

  • inplace (bool) – Whether to place calculated metrics in .var or return them.

  • batch_key (Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.

  • channel (str | None) –

  • channel_type (str | None) –

See also

This

https

//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html

class dance.transforms.HighlyVariableGenesRawCount(channel=None, channel_type=None, n_top_genes=1000, span=0.3, subset=True, inplace=True, batch_key=None, check_values=True, **kwargs)[source]

Filter for highly variable genes using raw count matrix.

Parameters:
  • layer – If provided, then use data.data.layers[layer] for expression values instead of the default data.data.X.

  • n_top_genes (Optional[int]) – Number of highly-variable genes to keep.

  • span (Optional[float]) – The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor=”seurat_v3”.

  • subset (bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.

  • inplace (bool) – Whether to place calculated metrics in .var or return them.

  • batch_key (Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.

  • check_values (bool) – Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if flavor=”seurat_v3”.

  • channel (str | None) –

  • channel_type (str | None) –

See also

This

https

//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html

class dance.transforms.Log1P(base=None, copy=False, chunked=None, chunk_size=None, layer=None, obsm=None, **kwargs)[source]

Logarithmize the data matrix.

Computes \(X = \log(X + 1)\), where \(log\) denotes the natural logarithm unless a different base is given.

Parameters:
  • base (Optional[Number]) – Base of the logarithm. Natural logarithm is used by default.

  • copy (bool) – If an AnnData is passed, determines whether a copy is returned.

  • chunked (Optional[bool]) – Process the data matrix in chunks, which will save memory. Applies only to AnnData.

  • chunk_size (Optional[int]) – n_obs of the chunks to process the data in.

  • layer (Optional[str]) – Entry of layers to transform.

  • obsm (Optional[str]) – Entry of obsm to transform.

See also

This

https

//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.log1p.html

class dance.transforms.MaskData(mask_rate=0.1, seed=None, **kwargs)[source]

Randomly mask data.

Randomly mask positive counts according to masking rate.

Parameters:
  • mask_rate (Optional[float]) – Masking rate.

  • seed (Optional[int]) – Random seed.

class dance.transforms.MorphologyFeatureCNN(*, model_name='resnet50', n_components=50, random_state=0, crop_size=20, target_size=299, device='cpu', channels=('spatial_pixel', 'image'), channel_types=('obsm', 'uns'), **kwargs)[source]

Cell morphological features extracted from CNN.

Parameters:
  • model_name (str) – Pretrained CNN name: "resnet50", "inceptron_v3", "xception", "vgg16".

  • n_components (int) – Number of feature dimension.

  • crop_size (int) – Cell image cropping size (cropped as square centered around the target cell).

  • target_size (int) – Target patch size.

  • random_state (int) –

  • device (str) –

  • channels (Sequence[str]) –

  • channel_types (Sequence[str]) –

References

https://doi.org/10.1101/2020.05.31.125658

class dance.transforms.NormalizePlaceHolder(**kwargs)[source]

Used as a placeholder to skip the process.

class dance.transforms.NormalizeTotal(target_sum=None, max_fraction=0.05, key_added=None, layer=None, layers=None, layer_norm=None, inplace=True, copy=False, **kwargs)[source]

Normalize counts per cell.

Normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. If choosing target_sum=1e6, this is CPM normalization.

If max_fraction is less than 1.0, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes.

Params

target_sum

If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.

max_fraction

Consider cells as highly expressed that have more counts than max_fraction of the original total counts in at least one cell. Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell. A gene is considered highly expressed, if it has more than max_fraction of the total counts in at least one cell. The not-excluded genes will sum up to target_sum.When max_fraction is equal to 1.0, it is equivalent to setting exclude_highly_expressed=False.

key_added

Name of the field in adata.obs where the normalization factor is stored.

layer

Layer to normalize instead of X. If None, X is normalized.

inplace

Whether to update adata or return dictionary with normalized copies of adata.X and adata.layers.

copy

Whether to modify copied input object. Not compatible with inplace=False.

See also

This

https

//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html

Parameters:
  • target_sum (float | None) –

  • max_fraction (float) –

  • key_added (str | None) –

  • layer (str | None) –

  • layers (Literal['all'] | ~typing.Iterable[str]) –

  • layer_norm (str | None) –

  • inplace (bool) –

  • copy (bool) –

class dance.transforms.NormalizeTotalLog1P(base=None, target_sum=None, max_fraction=0.05, **kwargs)[source]

Normalize total counts followed by log1p transformation.

See dance.transforms.normalize.NormalizeTotal and dance.transforms.normalize.Log1P.

class dance.transforms.PseudoMixture(*, n_pseudo=1000, nc_min=2, nc_max=10, ct_select='auto', ct_key='cellType', channel=None, channel_type='X', random_state=0, prefix='ps_mix_', in_split_name='ref', out_split_name='pseudo', label_batch=False, **kwargs)[source]

Pseudo mixture generation.

Parameters:
  • n_pseudo (int) –

  • nc_min (int) –

  • nc_max (int) –

  • ct_select (Literal['auto'] | ~typing.List[str]) –

  • ct_key (str) –

  • channel (str | None) –

  • channel_type (str | None) –

  • random_state (int | None) –

  • prefix (str) –

  • in_split_name (str) –

  • out_split_name (str | None) –

  • label_batch (bool) –

class dance.transforms.RemoveSplit(*, split_name, **kwargs)[source]

Remove a particular split from the data.

Parameters:

split_name (str) –

class dance.transforms.SCNFeature(num_top_genes=10, alpha1=0.05, alpha2=0.001, mu=2, num_top_gene_pairs=25, max_gene_per_ct=3, *, split_name='train', channel=None, channel_type=None, **kwargs)[source]

Differential gene-pair feature used in SingleCellNet.

Parameters:
  • num_top_genes (int) –

  • alpha1 (float) –

  • alpha2 (float) –

  • mu (float) –

  • num_top_gene_pairs (int) –

  • max_gene_per_ct (int) –

  • split_name (str | None) –

  • channel (str | None) –

  • channel_type (str | None) –

class dance.transforms.SMEFeature(n_neighbors=3, n_components=50, random_state=0, *, channels=(None, 'SMEGraph'), channel_types=(None, 'obsp'), **kwargs)[source]

Spatial Morphological gene Expression normalization feature from stLearn.

Parameters:
  • n_neighbors (int) – Number of spatial spots neighbors to consider.

  • n_components (int) – Number of feature dimension.

  • random_state (int) –

  • channels (Sequence[str | None]) –

  • channel_types (Sequence[str | None]) –

References

https://doi.org/10.1101/2020.05.31.125658

class dance.transforms.SaveRaw(exist_ok=False, **kwargs)[source]

Save raw data.

See anndata.AnnData.raw() for more information.

Parameters:

exist_ok (bool) – If set to False, then raise an exception if the raw attribute is already set.

class dance.transforms.ScTransform(split_names=None, batch_key=None, min_cells=5, gmean_eps=1, n_genes=2000, n_cells=None, bin_size=500, bw_adjust=3, processes_num=2, **kwargs)[source]

ScTransform normalization and variance stabiliation.

Note

This is a Python implementation adapted from https://github.com/atarashansky/SCTransformPy

Parameters:
  • split_names (Union[Literal['ALL'], List[str], None]) – Which split(s) to apply the transformation.

  • batch_key (Optional[str]) – Key for batch information.

  • min_cells (int) – Minimum number of cells the gene has to express in, below which that gene will be discarded.

  • gmean_eps (int) – Pseudocount.

  • n_genes (Optional[int]) – Maximum number of genes to use. Use all if set to None.

  • n_cells (Optional[int]) – maximum number of cells to use. Use all if set to None.

  • bin_size (int) – Number of genes a single bin contain.

  • bw_adjust (float) – Bandwidth adjusting parameter.

  • processes_num (int) – Number of processes. Default to the total number of available processors.

References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1

class dance.transforms.ScTransformR(min_cells=5, mirror_index=-1, **kwargs)[source]

ScTransform normalization and variance stabiliation.

Note

This is a wrapper for the original R implementation.

Parameters:

min_cells (int) – Minimum number of cells the gene has to express in, below which that gene will be discarded.

References

https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1

class dance.transforms.SetConfig(config_dict, dummy_params=10, **kwargs)[source]

Set configuration options of a dance data object.

Parameters:
  • config_dict (Dict[str, Any]) – Dance data object configuration dictionary. See set_config_from_dict().

  • dummy_params – When the search space is empty, use this parameter to verify through wandb

class dance.transforms.SpatialIDEFeature(channels=(None, 'spatial'), channel_types=(None, 'obsm'), **kwargs)[source]

Spatial IDE feature.

The SpatialDE model is based on the assumption of normally distributed residual noise and independent observations across cells. There are two normalization steps:

  1. Variance-stabilizing transformation for negative-binomial-distributed data (Anscombe’s transformation).

  2. Regress log total count values out from the Anscombe-transformed expression values.

References

https://www.nature.com/articles/nmeth.4636#Sec2

Parameters:
  • channels (Sequence[str | None]) –

  • channel_types (Sequence[str | None]) –

regress_out(sample_info, expression_matrix, covariate_formula, design_formula='1', rcond=-1)[source]

Implementation of limma’s removeBatchEffect function.

stabilize(expression_matrix)[source]

Use Anscombes approximation to variance stabilize Negative Binomial data.

See https://f1000research.com/posters/4-1041 for motivation.

Assumes columns are samples, and rows are genes

class dance.transforms.UpdateRaw(**kwargs)[source]

Update raw data.

Some data may select genes again after normalizing. At this time, the original raw_data needs to be modified to match the filtered data dimensions. See anndata.AnnData.raw() for more information.

class dance.transforms.UpdateSizeFactors(**kwargs)[source]

Update sizefactors.

class dance.transforms.WeightedFeaturePCA(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, save_info=False, **kwargs)[source]

Compute the weighted gene PCA as cell features.

Given a gene expression matrix of dimension (cell x gene), the gene PCA is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.

Parameters:
  • n_components (Union[float, int]) – Number of PCs to use.

  • split_name (Optional[str]) – Which split to use to compute the gene PCA. If not set, use all data.

  • feat_norm_mode (Optional[str]) – Feature normalization mode, see dance.utils.matrix.normalize(). If set to None, then do not perform feature normalization before reduction.

  • feat_norm_axis (int) –

class dance.transforms.WeightedFeatureSVD(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, save_info=False, **kwargs)[source]

Compute the weighted gene SVD as cell features.

Given a gene expression matrix of dimension (cell x gene), the gene SVD is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.

Parameters:
  • n_components (Union[float, int]) – Desired dimensionality of output data.

  • split_name (Optional[str]) – Which split to use to compute the gene SVD. If not set, use all data.

  • feat_norm_mode (Optional[str]) – Feature normalization mode, see dance.utils.matrix.normalize(). If set to None, then do not perform feature normalization before reduction.

  • feat_norm_axis (int) –

  • save_info (bool) –

class dance.transforms.tfidfTransform(**kwargs)[source]