dance.transforms
- class dance.transforms.base.BaseTransform(out=None, log_level='WARNING')[source]
BaseTransform abstract object.
- Parameters:
log_level (
Literal['NOTSET','DEBUG','INFO','WARNING','ERROR']) – Logging level.out (
Optional[str]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.
- class dance.transforms.AnnDataAdaptor(transform, **data_init_kwargs)[source]
Adaptor for transforming AnnData instead of dance data object.
Example
Modify an
AnnDataobject inplace>>> AnnDataAdaptor(FilterGenes(mode="sum"))(adata)
- class dance.transforms.AnnDataTransform(func, **kwargs)[source]
AnnData transformation interface object.
This object provides an interface with any function that apply in-place transformation to an AnnData object.
Example
Any one of the
scanpy.ppfunctions should be supported. For example, we can use thescanpy.pp.normalize_total()function on the dance data object as follows>>> AnnDataTransform(scanpy.pp.normalize_total, target_sum=10000)(data)
where
datais a dance data object, e.g.,dance.data.Data. Calling the above function is effectively equivalent to calling>>> scanpy.pp.normalize_total(data.data, target_sum=10000)
- Parameters:
func (Callable | str) –
- class dance.transforms.BaseTransform(out=None, log_level='WARNING')[source]
BaseTransform abstract object.
- Parameters:
log_level (
Literal['NOTSET','DEBUG','INFO','WARNING','ERROR']) – Logging level.out (
Optional[str]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.
- class dance.transforms.BatchFeature(*, channel=None, mod=None, **kwargs)[source]
Assign statistical batch features for each cell.
- Parameters:
channel (str | None) –
mod (str | None) –
- class dance.transforms.CellGiottoTopicProfile(*, ct_select='auto', ct_key='cellType', split_name=None, channel=None, channel_type='X', detection_threshold=-1, **kwargs)[source]
Giotto cell topic profile.
References
https://rubd.github.io/Giotto_site/reference/findGiniMarkers_one_vs_all.html
- Parameters:
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
detection_threshold (float) –
- class dance.transforms.CellPCA(n_components=400, *, channel=None, mod=None, save_info=False, svd_solver='auto', **kwargs)[source]
Reduce cell feature matrix with PCA.
- Parameters:
n_components (
Union[float,int]) – Number of PCA components to use.channel (str | None) –
mod (str | None) –
save_info (bool) –
svd_solver (Literal['auto', 'full', 'arpack', 'randomized']) –
- class dance.transforms.CellSVD(n_components=400, *, channel=None, mod=None, algorithm='randomized', save_info=True, **kwargs)[source]
Reduce cell feature matrix with SVD.
- Parameters:
n_components (
Union[float,int]) – Number of SVD components to take.channel (str | None) –
mod (str | None) –
algorithm (Literal['arpack', 'randomized']) –
- class dance.transforms.CellSparsePCA(n_components=400, *, channel=None, mod=None, **kwargs)[source]
Reduce cell feature matrix with SparsePCA.
- Parameters:
n_components (
Union[float,int]) – Number of SparsePCA components to use.channel (str | None) –
mod (str | None) –
- class dance.transforms.CellTopicProfile(*, ct_select='auto', ct_key='cellType', batch_key=None, split_name=None, channel=None, channel_type='X', method='median', **kwargs)[source]
Cell topic profile.
- Parameters:
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
batch_key (str | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
method (Literal['median', 'mean']) –
- class dance.transforms.CellwiseMaskData(distr='exp', mask_rate=0.1, seed=None, min_gene_counts=5, add_test_mask=False, **kwargs)[source]
Randomly mask data in a cell-wise approach.
For every cell that has more than min_gene_counts positive counts, mask positive counts according to mask_rate and probability generated from the specified distribution.
The masked entries are assigned to validation and optionally test masks.
- Parameters:
distr (
Optional[Literal['exp','uniform']]) – Distribution to generate probabilities for masking counts. Higher counts might have different probabilities depending on the distribution.mask_rate (
Optional[float]) – Overall masking rate (proportion of positive counts to mask per cell).seed (
Optional[int]) – Random seed for reproducibility.min_gene_counts (
int) – Minimum number of positive counts within a cell below which we do not mask that cell.add_test_mask (
bool) – If True, the masked entries (determined by mask_rate) are further split into validation and test sets. Approximately 10% of the masked entries go to valid_mask, and the remaining 90% go to test_mask. If False, all masked entries go to valid_mask, and test_mask will be empty (all False).**kwargs – Additional keyword arguments passed to the base class.
- class dance.transforms.ColumnSumNormalize(*, axis=0, split_names=None, batch_key=None, mode='normalize', eps=-1, **kwargs)[source]
Scale the feature matrix in the AnnData object.
This is an extension of
scanpy.pp.scale(), allowing split- or batch-wide scaling.- Parameters:
axis (
int) – Axis along which the scaling is performed.split_names (
Union[Literal['ALL'],List[str],None]) – Indicate which splits to perform the scaling independently. If set to ‘ALL’, then go through all splits available in the data.batch_key (
Optional[str]) – Indicate which column in.obsto use as the batch index to guide the batch-wide scaling.mode (
Literal['normalize','standardize','minmax','l2']) – Scaling mode, seedance.utils.matrix.normalize()for more information.eps (
float) – Correction fact, seedance.utils.matrix.normalize()for more information.
Note
The order of checking split- or batch-wide scaling mode is: batch_key > split_names > None (i.e., all).
- class dance.transforms.Compose(*transforms, use_master_log_level=True, **kwargs)[source]
Compose transformation by combining several transfomration objects.
- Parameters:
transforms (
Tuple[BaseTransform,...]) – Transformation objects.use_master_log_level (
bool) – If set toTrue, then reset all transforms’ loggers to use :then reset all transforms’ loggers to uselog_leveloption passed to thisComposeobject.
Notes
The order in which the
transformobject are passed will be exactly the order in which they will be applied to the data object.- hexdigest()[source]
Return MD5 hash using the representation of the transform object.
- Return type:
str
- class dance.transforms.FeatureCellPlaceHolder(n_components=400, *, channel=None, mod=None, **kwargs)[source]
Used as a placeholder to skip the process.
- Parameters:
n_components (
int) – it will not be usedchannel (str | None) –
mod (str | None) –
- class dance.transforms.FilterCellTransform(species='human', image_save_path=None, **kwargs)[source]
- Parameters:
species (Literal['human', 'mouse']) –
image_save_path (str) –
- class dance.transforms.FilterCellsCommonMod(mod1, mod2, sol=None, **kwargs)[source]
Initialize the FilterCellsCommonMod class.
- Parameters:
mod1 (str) – Name of the first modality in the single-cell dataset.
mod2 (str) – Name of the second modality in the single-cell dataset.
sol (Optional[str], default=None) – Name of the optional solution dataset containing cell labels or annotations.
**kwargs (dict) – Additional keyword arguments passed to the base transformation class.
- class dance.transforms.FilterCellsPlaceHolder(split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_genes=True, inplace=True, **kwargs)[source]
Used as a placeholder to skip the process.
- Parameters:
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
- class dance.transforms.FilterCellsScanpy(min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_genes=None, inplace=True, **kwargs)[source]
Scanpy filtering cell transformation with additional options.
Allow passing gene counts as ratio
- Parameters:
min_counts (
Union[float,int,None]) – Minimum number of counts required for a cell to be kept.min_genes (
Union[float,int,None]) – Minimum number (or ratio) of genes required for a cell to be kept.max_counts (
Union[float,int,None]) – Maximum number of counts required for a cell to be kept.max_genes (
Union[float,int,None]) – Maximum number (or ratio) of genes required for a cell to be kept.split_name (
Optional[str]) – Which split to be used for filtering.channel (
Optional[str]) – Channel to be used for filtering.channel_type (
Optional[str]) – Channel type to be used for filtering.key_n_counts (
Optional[str]) – The location to add n_counts(the total counts for each cell). If it is None, it will not be added.key_n_genes (
Optional[str]) – The location to add n_genes(the number of genes expressed for each cell). If it is None, it will not be added.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in varm
- class dance.transforms.FilterCellsScanpyOrder(order=None, min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_genes=True, inplace=True, **kwargs)[source]
Scanpy filtering cell transformation with additional options.
Allow passing gene counts as ratio
- Parameters:
order (
Optional[List[str]]) – Order of (min_counts, min_cells, max_counts, max_cells). For example,["min_counts", "min_genes", "max_counts", "max_genes"]or["max_counts", "min_genes"]. If not set, will be set by default to["min_counts", "min_genes", "max_counts", "max_genes"].min_counts (
Union[float,int,None]) – Minimum number of counts required for a cell to be kept.min_genes (
Union[float,int,None]) – Minimum number (or ratio) of genes required for a cell to be kept.max_counts (
Union[float,int,None]) – Maximum number of counts required for a cell to be kept.max_genes (
Union[float,int,None]) – Maximum number (or ratio) of genes required for a cell to be kept.split_name (
Optional[str]) – Which split to be used for filtering.channel (
Optional[str]) – Channel to be used for filtering.channel_type (
Optional[str]) – Channel type to be used for filtering.add_n_counts – Whether to add
n_counts, the total counts for each cell.add_n_genes – Whether to add
n_genes, the number of genes expressed for each cell.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in varm
- class dance.transforms.FilterCellsType(cell_type_threshold=10, **kwargs)[source]
Filter cell types based on the threshold.
- class dance.transforms.FilterGenes(*, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]
Filter genes based on the summarized gene expressions.
- Parameters:
mode (Literal['sum', 'cv', 'rv', 'var']) –
channel (str | None) –
channel_type (str | None) –
whitelist_indicators (List[str] | str | None) –
- class dance.transforms.FilterGenesCommon(batch_key=None, split_keys=None, **kwargs)[source]
Filter genes by taking the common genes across batches or splits.
- Parameters:
batch_key (
Optional[str]) – Which column in the.obstable to be used to distinguishing batches.split_keys (
Optional[List[str]]) – A list of split names, e.g., ‘train’, to be used to find common gnees.
Note
One and only one of
batch_keyorsplit_keyscan be specified.
- class dance.transforms.FilterGenesMarker(*, ct_profile_channel='CellTopicProfile', subset=True, label=None, threshold=1.25, eps=1e-06, **kwargs)[source]
Select marker genes based on log fold-change.
- Parameters:
ct_profile_channel (
str) – Name of the.varmchannel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).subset (
bool) – If set toTrue, then inplace subset the variables to only contain the markers.label (
Optional[str]) – If set, e.g., to'marker', then save the marker indicator to theobscolumn named asmarker.threshold (
float) – Threshold value of the log fol-change above which the gene will be considered as a marker.eps (
float) – A small value that prevents taking log of zeros.
- class dance.transforms.FilterGenesMarkerGini(*, ct_profile_channel='CellGiottoTopicProfile', ct_profile_detection_channel='CellGiottoDetectionTopicProfile', subset=True, label=None, **kwargs)[source]
Select marker genes based on Gini coefficient.
Identfy marker genes for all clusters in a one vs all manner based on Gini coefficients, a measure for inequality.
- Parameters:
ct_profile_channel (
str) – Name of the.varmchannel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).ct_profile_detection_channel (
str) – Name of the.varmchannel that contains the cell-topic profile nums which greater than some value which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).subset (
bool) – If set toTrue, then inplace subset the variables to only contain the markers.label (
Optional[str]) – If set, e.g., to'marker', then save the marker indicator to theobscolumn named asmarker.
References
- class dance.transforms.FilterGenesMatch(prefixes=None, suffixes=None, case_sensitive=False, **kwargs)[source]
Filter genes based on prefixes and suffixes.
- Parameters:
prefixes (
Optional[List[str]]) – List of prefixes to remove.suffixes (
Optional[List[str]]) – List of suffixes to remove.case_sensitive (bool) –
- class dance.transforms.FilterGenesNumberPlaceHolder(channel=None, channel_type=None, **kwargs)[source]
- class dance.transforms.FilterGenesPercentile(min_val=1, max_val=99, *, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]
Filter genes based on percentiles of the summarized gene expressions.
- Parameters:
min_val (
Optional[float]) – Minimum percentile of the summarized expression value below which the genes will be discarded.max_val (
Optional[float]) – Maximum percentile of the summarized expression value above which the genes will be discarded.mode (
Literal['sum','cv','rv','var']) – Summarization mode. Available options are[sum|var|cv|rv].sumcalculates the sum of expression values,varcalculates the variance of the expression values,cvuses the coefficient of variation (std / mean ), andrvuses the relative variance (var / mean).channel (
Optional[str]) – Which channel, more specificailly,layers, to use. Use the default.Xif not set. Ifchannelis specified, then need to specifychannel_typeto belayersas well.channel_type (
Optional[str]) – Type of channels specified. Only allowNone(the default setting) orlayers(whenchannelis specified).whitelist_indicators (
Union[List[str],str,None]) – A list of (or a single)varcolumns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.add_n_counts – Whether to add
n_counts, the total counts for each gene.add_n_cells – Whether to add
n_cells, the number of cells expressed for each gene.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm
- class dance.transforms.FilterGenesPlaceHolder(split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_cells=True, inplace=True, **kwargs)[source]
Used as a placeholder to skip the process.
- Parameters:
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
- class dance.transforms.FilterGenesRegression(method='enclasc', num_genes=1000, *, channel=None, channel_type=None, mod=None, skip_count_check=False, inplace=True, **kwargs)[source]
Select genes based on regression.
- Parameters:
method (
str) – What regression based gene selection methtod to use. Supported options are:"enclasc","seurat3", and"scmap".num_genes (
int) – Number of genes to select.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm
channel (str | None) –
channel_type (str | None) –
mod (str | None) –
skip_count_check (bool) –
Note
The implementation is adapted from the EnClaSC GitHub repo: https://github.com/xy-chen16/EnClaSC
References
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z
- class dance.transforms.FilterGenesScanpy(min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_cells=None, inplace=True, **kwargs)[source]
Scanpy filtering gene transformation with additional options.
- Parameters:
min_counts (
Union[float,int,None]) – Minimum number of counts required for a gene to be kept.min_cells (
Union[float,int,None]) – Minimum number (or ratio) of cells required for a gene to be kept.max_counts (
Union[float,int,None]) – Maximum number of counts required for a gene to be kept.max_cells (
Union[float,int,None]) – Maximum number (or ratio) of cells required for a gene to be kept.split_name (
Optional[str]) – Which split to be used for filtering.channel (
Optional[str]) – Channel to be used for filtering.channel_type (
Optional[str]) – Channel type to be used for filtering.key_n_counts (
Optional[str]) – The location to add n_counts(the total counts for each gene). If it is None, it will not be added.key_n_cells (
Optional[str]) – The location to add n_cells(the number of cells expressed for each gene). If it is None, it will not be added.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm
- class dance.transforms.FilterGenesScanpyOrder(order=None, min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', add_n_counts=True, add_n_cells=True, inplace=True, params_dict=None, **kwargs)[source]
Scanpy filtering gene transformation with additional options.
- Parameters:
order (
Optional[List[str]]) – Order of (min_counts, min_cells, max_counts, max_cells). For example,["min_counts", "min_cells", "max_counts", "max_cells"]or["max_counts", "min_cells"]. If not set, will be set by default to["min_counts", "min_cells", "max_counts", "max_cells"].min_counts (
Union[float,int,None]) – Minimum number of counts required for a gene to be kept.min_cells (
Union[float,int,None]) – Minimum number (or ratio) of cells required for a gene to be kept.max_counts (
Union[float,int,None]) – Maximum number of counts required for a gene to be kept.max_cells (
Union[float,int,None]) – Maximum number (or ratio) of cells required for a gene to be kept.split_name (
Optional[str]) – Which split to be used for filtering.channel (
Optional[str]) – Channel to be used for filtering.channel_type (
Optional[str]) – Channel type to be used for filtering.add_n_counts – Whether to add
n_counts, the total counts for each gene.add_n_cells – Whether to add
n_cells, the number of cells expressed for each gene.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm
- class dance.transforms.FilterGenesTopK(num_genes=1000, top=True, *, mode='cv', channel=None, channel_type='X', whitelist_indicators=None, add_n_counts=False, add_n_cells=False, inplace=True, **kwargs)[source]
Select top/bottom genes based on the summarized gene expressions.
- Parameters:
num_genes (
int) – Number of genes to be selected.top (
bool) – If set toTrue, then use the genes with highest values of the specified gene summary stats.mode (
Literal['sum','cv','rv','var']) – Summarization mode. Available options are[sum|var|cv|rv].sumcalculates the sum of expression values,varcalculates the variance of the expression values,cvuses the coefficient of variation (std / mean ), andrvuses the relative variance (var / mean).channel (
Optional[str]) – Which channel, more specificailly,layers, to use. Use the default.Xif not set. Ifchannelis specified, then need to specifychannel_typeto belayersas well.channel_type (
Optional[str]) – Type of channels specified. Only allowNone(the default setting) orlayers(whenchannelis specified).whitelist_indicators (
Union[List[str],str,None]) – A list of (or a single)varcolumns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.add_n_counts – Whether to add
n_counts, the total counts for each gene.add_n_cells – Whether to add
n_cells, the number of cells expressed for each gene.inplace – If inplace is True, the original data is replaced with the filtered data. If inplace is False, the filtered data is stored in obsm
- class dance.transforms.FilterScanpy(min_counts=None, min_genes_or_cells=None, max_counts=None, max_genes_or_cells=None, split_name=None, channel=None, channel_type='X', key_n_counts=None, key_n_genes_or_cells=None, inplace=True, **kwargs)[source]
Scanpy filtering transformation with additional options.
- Parameters:
min_counts (float | int | None) –
min_genes_or_cells (float | int | None) –
max_counts (float | int | None) –
max_genes_or_cells (float | int | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
key_n_counts (str | None) –
key_n_genes_or_cells (str | None) –
- class dance.transforms.GaussRandProjFeature(n_components=400, eps=0.1, **kwargs)[source]
Custom preprocessing to extract cell feature via Gaussian random projection.
- Parameters:
n_components (int) –
eps (float) –
- class dance.transforms.GeneHoldout(n_top=5, batch_size=512, random_state=None, **kwargs)[source]
Progressively hold out genes for DeepImpute.
Split genes into target batches. For every target gene in one batch, refer to the genes that are not in this batch and select predictor genes with high covariance with target gene.
- Parameters:
n_top (
int) – Number of predictor genes per target gene.batch_size (
int) – Target batch size.random_state (
Optional[int]) – Random state.
- class dance.transforms.GeneStats(genestats_select='all', *, fill_na=None, threshold=0, pseudo=False, split_name='train', channel=None, channel_type=None, **kwargs)[source]
Gene statistics computation.
- Parameters:
genestats_select (
Union[str,List[str]]) – List of names of the gene stats functions to use. If set to"all"(by default), then use all available gene stats functions.fill_na (
Optional[float]) – If not set (default), then do not fill nans. Otherwise, fill nans with the specified value.threshold (
float) – Threshold value for filtering gene expression when computing stats, e.g., mean expression values.pseudo (
bool) – If set toTrue, then add1to the numerator and denominator when computing the ratio (alpha) for which the gene expression values are above the specifiedthreshold.split_name (
Optional[str]) – Which split to compute the gene stats on.channel (str | None) –
channel_type (str | None) –
- class dance.transforms.HighlyVariableGenesLogarithmizedByMeanAndDisp(channel=None, channel_type=None, min_disp=0.5, max_disp=inf, min_mean=0.0125, max_mean=3, n_bins=20, flavor='seurat', subset=True, inplace=True, batch_key=None, **kwargs)[source]
Filter for highly variable genes based on mean and dispersion.
- Parameters:
layer – If provided, then use data.data.layers[layer]` for expression values instead of the default data.data.X.
min_mean (
Optional[float]) – min_meanmax_mean (
Optional[float]) – max_meanmin_disp (
Optional[float]) – min_dispmax_disp (
Optional[float]) – max_dispn_bins (
int) – Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4.flavor (
Literal['seurat','cell_ranger']) – Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.subset (
bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.inplace (
bool) – Whether to place calculated metrics in .var or return them.batch_key (
Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.channel (str | None) –
channel_type (str | None) –
See also
Thishttps//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html
- class dance.transforms.HighlyVariableGenesLogarithmizedByTopGenes(channel=None, channel_type=None, n_top_genes=1000, n_bins=20, flavor='seurat', subset=True, inplace=True, batch_key=None, **kwargs)[source]
Filter for highly variable genes based on top genes.
- Parameters:
layer – If provided, then use data.data.layers[layer]` for expression values instead of the default data.data.X.
n_top_genes (
Optional[int]) – Number of highly-variable genes to keep.n_bins (
int) – Number of bins for binning the mean gene expression. Normalization is done with respect to each bin. If just a single gene falls into a bin, the normalized dispersion is artificially set to 1. You’ll be informed about this if you set settings.verbosity = 4.flavor (
Literal['seurat','cell_ranger']) – Choose the flavor for identifying highly variable genes. For the dispersion based methods in their default workflows, Seurat passes the cutoffs whereas Cell Ranger passes n_top_genes.subset (
bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.inplace (
bool) – Whether to place calculated metrics in .var or return them.batch_key (
Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.channel (str | None) –
channel_type (str | None) –
See also
Thishttps//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html
- class dance.transforms.HighlyVariableGenesRawCount(channel=None, channel_type=None, n_top_genes=1000, span=0.3, subset=True, inplace=True, batch_key=None, check_values=True, **kwargs)[source]
Filter for highly variable genes using raw count matrix.
- Parameters:
layer – If provided, then use data.data.layers[layer] for expression values instead of the default
data.data.X.n_top_genes (
Optional[int]) – Number of highly-variable genes to keep.span (
Optional[float]) – The fraction of the data (cells) used when estimating the variance in the loess model fit if flavor=”seurat_v3”.subset (
bool) – Inplace subset to highly-variable genes if True otherwise merely indicate highly variable genes.inplace (
bool) – Whether to place calculated metrics in .var or return them.batch_key (
Optional[str]) – If specified, highly-variable genes are selected within each batch separately and merged. This simple process avoids the selection of batch-specific genes and acts as a lightweight batch correction method. For all flavors, genes are first sorted by how many batches they are a HVG. For dispersion-based flavors ties are broken by normalized dispersion. If flavor = “seurat_v3”, ties are broken by the median (across batches) rank based on within-batch normalized variance.check_values (
bool) – Check if counts in selected layer are integers. A Warning is returned if set to True. Only used if flavor=”seurat_v3”.channel (str | None) –
channel_type (str | None) –
See also
Thishttps//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.highly_variable_genes.html
- class dance.transforms.Log1P(base=None, copy=False, chunked=None, chunk_size=None, layer=None, obsm=None, **kwargs)[source]
Logarithmize the data matrix.
Computes \(X = \log(X + 1)\), where \(log\) denotes the natural logarithm unless a different base is given.
- Parameters:
base (
Optional[Number]) – Base of the logarithm. Natural logarithm is used by default.copy (
bool) – If anAnnDatais passed, determines whether a copy is returned.chunked (
Optional[bool]) – Process the data matrix in chunks, which will save memory. Applies only toAnnData.chunk_size (
Optional[int]) – n_obs of the chunks to process the data in.layer (
Optional[str]) – Entry of layers to transform.obsm (
Optional[str]) – Entry of obsm to transform.
See also
Thishttps//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.log1p.html
- class dance.transforms.MaskData(mask_rate=0.1, seed=None, **kwargs)[source]
Randomly mask data.
Randomly mask positive counts according to masking rate.
- Parameters:
mask_rate (
Optional[float]) – Masking rate.seed (
Optional[int]) – Random seed.
- class dance.transforms.MorphologyFeatureCNN(*, model_name='resnet50', n_components=50, random_state=0, crop_size=20, target_size=299, device='cpu', channels=('spatial_pixel', 'image'), channel_types=('obsm', 'uns'), **kwargs)[source]
Cell morphological features extracted from CNN.
- Parameters:
model_name (
str) – Pretrained CNN name:"resnet50","inceptron_v3","xception","vgg16".n_components (
int) – Number of feature dimension.crop_size (
int) – Cell image cropping size (cropped as square centered around the target cell).target_size (
int) – Target patch size.random_state (int) –
device (str) –
channels (Sequence[str]) –
channel_types (Sequence[str]) –
References
- class dance.transforms.NormalizePlaceHolder(**kwargs)[source]
Used as a placeholder to skip the process.
- class dance.transforms.NormalizeTotal(target_sum=None, max_fraction=0.05, key_added=None, layer=None, layers=None, layer_norm=None, inplace=True, copy=False, **kwargs)[source]
Normalize counts per cell.
Normalize each cell by total counts over all genes, so that every cell has the same total count after normalization. If choosing target_sum=1e6, this is CPM normalization.
If max_fraction is less than 1.0, very highly expressed genes are excluded from the computation of the normalization factor (size factor) for each cell. This is meaningful as these can strongly influence the resulting normalized values for all other genes.
Params
- target_sum
If None, after normalization, each observation (cell) has a total count equal to the median of total counts for observations (cells) before normalization.
- max_fraction
Consider cells as highly expressed that have more counts than max_fraction of the original total counts in at least one cell. Exclude (very) highly expressed genes for the computation of the normalization factor (size factor) for each cell. A gene is considered highly expressed, if it has more than max_fraction of the total counts in at least one cell. The not-excluded genes will sum up to target_sum.When max_fraction is equal to 1.0, it is equivalent to setting exclude_highly_expressed=False.
- key_added
Name of the field in adata.obs where the normalization factor is stored.
- layer
Layer to normalize instead of X. If None, X is normalized.
- inplace
Whether to update adata or return dictionary with normalized copies of adata.X and adata.layers.
- copy
Whether to modify copied input object. Not compatible with inplace=False.
See also
Thishttps//scanpy.readthedocs.io/en/stable/generated/scanpy.pp.normalize_total.html
- Parameters:
target_sum (float | None) –
max_fraction (float) –
key_added (str | None) –
layer (str | None) –
layers (Literal['all'] | ~typing.Iterable[str]) –
layer_norm (str | None) –
inplace (bool) –
copy (bool) –
- class dance.transforms.NormalizeTotalLog1P(base=None, target_sum=None, max_fraction=0.05, **kwargs)[source]
Normalize total counts followed by log1p transformation.
See
dance.transforms.normalize.NormalizeTotalanddance.transforms.normalize.Log1P.
- class dance.transforms.PseudoMixture(*, n_pseudo=1000, nc_min=2, nc_max=10, ct_select='auto', ct_key='cellType', channel=None, channel_type='X', random_state=0, prefix='ps_mix_', in_split_name='ref', out_split_name='pseudo', label_batch=False, **kwargs)[source]
Pseudo mixture generation.
- Parameters:
n_pseudo (int) –
nc_min (int) –
nc_max (int) –
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
channel (str | None) –
channel_type (str | None) –
random_state (int | None) –
prefix (str) –
in_split_name (str) –
out_split_name (str | None) –
label_batch (bool) –
- class dance.transforms.RemoveSplit(*, split_name, **kwargs)[source]
Remove a particular split from the data.
- Parameters:
split_name (str) –
- class dance.transforms.SCNFeature(num_top_genes=10, alpha1=0.05, alpha2=0.001, mu=2, num_top_gene_pairs=25, max_gene_per_ct=3, *, split_name='train', channel=None, channel_type=None, **kwargs)[source]
Differential gene-pair feature used in SingleCellNet.
- Parameters:
num_top_genes (int) –
alpha1 (float) –
alpha2 (float) –
mu (float) –
num_top_gene_pairs (int) –
max_gene_per_ct (int) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
- class dance.transforms.SMEFeature(n_neighbors=3, n_components=50, random_state=0, *, channels=(None, 'SMEGraph'), channel_types=(None, 'obsp'), **kwargs)[source]
Spatial Morphological gene Expression normalization feature from stLearn.
- Parameters:
n_neighbors (
int) – Number of spatial spots neighbors to consider.n_components (
int) – Number of feature dimension.random_state (int) –
channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –
References
- class dance.transforms.SaveRaw(exist_ok=False, **kwargs)[source]
Save raw data.
See
anndata.AnnData.raw()for more information.- Parameters:
exist_ok (
bool) – If set to False, then raise an exception if therawattribute is already set.
- class dance.transforms.ScTransform(split_names=None, batch_key=None, min_cells=5, gmean_eps=1, n_genes=2000, n_cells=None, bin_size=500, bw_adjust=3, processes_num=2, **kwargs)[source]
ScTransform normalization and variance stabiliation.
Note
This is a Python implementation adapted from https://github.com/atarashansky/SCTransformPy
- Parameters:
split_names (
Union[Literal['ALL'],List[str],None]) – Which split(s) to apply the transformation.batch_key (
Optional[str]) – Key for batch information.min_cells (
int) – Minimum number of cells the gene has to express in, below which that gene will be discarded.gmean_eps (
int) – Pseudocount.n_genes (
Optional[int]) – Maximum number of genes to use. Use all if set toNone.n_cells (
Optional[int]) – maximum number of cells to use. Use all if set toNone.bin_size (
int) – Number of genes a single bin contain.bw_adjust (
float) – Bandwidth adjusting parameter.processes_num (
int) – Number of processes. Default to the total number of available processors.
References
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1
- class dance.transforms.ScTransformR(min_cells=5, mirror_index=-1, **kwargs)[source]
ScTransform normalization and variance stabiliation.
Note
This is a wrapper for the original R implementation.
- Parameters:
min_cells (
int) – Minimum number of cells the gene has to express in, below which that gene will be discarded.
References
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1
- class dance.transforms.SetConfig(config_dict, dummy_params=10, **kwargs)[source]
Set configuration options of a dance data object.
- Parameters:
config_dict (
Dict[str,Any]) – Dance data object configuration dictionary. Seeset_config_from_dict().dummy_params – When the search space is empty, use this parameter to verify through wandb
- class dance.transforms.SpatialIDEFeature(channels=(None, 'spatial'), channel_types=(None, 'obsm'), **kwargs)[source]
Spatial IDE feature.
The SpatialDE model is based on the assumption of normally distributed residual noise and independent observations across cells. There are two normalization steps:
Variance-stabilizing transformation for negative-binomial-distributed data (Anscombe’s transformation).
Regress log total count values out from the Anscombe-transformed expression values.
References
https://www.nature.com/articles/nmeth.4636#Sec2
- Parameters:
channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –
- regress_out(sample_info, expression_matrix, covariate_formula, design_formula='1', rcond=-1)[source]
Implementation of limma’s removeBatchEffect function.
- stabilize(expression_matrix)[source]
Use Anscombes approximation to variance stabilize Negative Binomial data.
See https://f1000research.com/posters/4-1041 for motivation.
Assumes columns are samples, and rows are genes
- class dance.transforms.UpdateRaw(**kwargs)[source]
Update raw data.
Some data may select genes again after normalizing. At this time, the original raw_data needs to be modified to match the filtered data dimensions. See
anndata.AnnData.raw()for more information.
- class dance.transforms.WeightedFeaturePCA(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, save_info=False, **kwargs)[source]
Compute the weighted gene PCA as cell features.
Given a gene expression matrix of dimension (cell x gene), the gene PCA is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.
- Parameters:
n_components (
Union[float,int]) – Number of PCs to use.split_name (
Optional[str]) – Which split to use to compute the gene PCA. If not set, use all data.feat_norm_mode (
Optional[str]) – Feature normalization mode, seedance.utils.matrix.normalize(). If set to None, then do not perform feature normalization before reduction.feat_norm_axis (int) –
- class dance.transforms.WeightedFeatureSVD(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, save_info=False, **kwargs)[source]
Compute the weighted gene SVD as cell features.
Given a gene expression matrix of dimension (cell x gene), the gene SVD is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.
- Parameters:
n_components (
Union[float,int]) – Desired dimensionality of output data.split_name (
Optional[str]) – Which split to use to compute the gene SVD. If not set, use all data.feat_norm_mode (
Optional[str]) – Feature normalization mode, seedance.utils.matrix.normalize(). If set to None, then do not perform feature normalization before reduction.feat_norm_axis (int) –
save_info (bool) –