dance.transforms
- class dance.transforms.base.BaseTransform(out=None, log_level='WARNING')[source]
BaseTransform abstract object.
- Parameters:
log_level (
Literal
['NOTSET'
,'DEBUG'
,'INFO'
,'WARNING'
,'ERROR'
]) – Logging level.out (
Optional
[str
]) – Name of the obsm channel or layer where the transformed features will be saved. Use the current transformation name if it is not set.
- class dance.transforms.AnnDataTransform(func, **kwargs)[source]
AnnData transformation interface object.
This object provides an interface with any function that apply in-place transformation to an AnnData object.
Example
Any one of the
scanpy.pp
functions should be supported. For example, we can use thescanpy.pp.normalize_total()
function on the dance data object as follows>>> AnnDataTransform(scanpy.pp.normalize_total, target_sum=10000)(data)
where
data
is a dance data object, e.g.,dance.data.Data
. Calling the above function is effectively equivalent to calling>>> scanpy.pp.normalize_total(data.data, target_sum=10000)
- Parameters:
func (Callable | str) –
- class dance.transforms.BatchFeature(*, channel=None, mod=None, **kwargs)[source]
Assign statistical batch features for each cell.
- Parameters:
channel (str | None) –
mod (str | None) –
- class dance.transforms.CellGiottoTopicProfile(*, ct_select='auto', ct_key='cellType', split_name=None, channel=None, channel_type='X', detection_threshold=-1, **kwargs)[source]
Giotto cell topic profile.
Reference
https://rubd.github.io/Giotto_site/reference/findGiniMarkers_one_vs_all.html
- Parameters:
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
detection_threshold (float) –
- class dance.transforms.CellPCA(n_components=400, *, channel=None, mod=None, **kwargs)[source]
Reduce cell feature matrix with PCA.
- Parameters:
n_components (
int
) – Number of PCA components to use.channel (str | None) –
mod (str | None) –
- class dance.transforms.CellSVD(n_components=400, *, channel=None, mod=None, **kwargs)[source]
Reduce cell feature matrix with SVD.
- Parameters:
n_components (
int
) – Number of SVD components to take.channel (str | None) –
mod (str | None) –
- class dance.transforms.CellTopicProfile(*, ct_select='auto', ct_key='cellType', batch_key=None, split_name=None, channel=None, channel_type='X', method='median', **kwargs)[source]
Cell topic profile.
- Parameters:
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
batch_key (str | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str) –
method (Literal['median', 'mean']) –
- class dance.transforms.CellwiseMaskData(distr='exp', mask_rate=0.1, seed=None, min_gene_counts=5, **kwargs)[source]
Randomly mask data in a cell-wise approach.
For every cell that has more than 5 positive counts, mask positive counts according to masking rate and probabiliy generated from distribution.
- Parameters:
distr (
Optional
[Literal
['exp'
,'uniform'
]]) – Distribution to generate masks.mask_rate (
Optional
[float
]) – Masking rate.seed (
Optional
[int
]) – Random seed.Min_gene_counts – Minimum number of genes expressed within a below which we do not mask that cell.
min_gene_counts (int) –
- class dance.transforms.Compose(*transforms, use_master_log_level=True, **kwargs)[source]
Compose transformation by combining several transfomration objects.
- Parameters:
transforms (
Tuple
[BaseTransform
,...
]) – Transformation objects.use_master_log_level (
bool
) – If set toTrue
, then reset all transforms’ loggers to use :then reset all transforms’ loggers to uselog_level
option passed to thisCompose
object.
Notes
The order in which the
transform
object are passed will be exactly the order in which they will be applied to the data object.
- class dance.transforms.FilterCellsScanpy(min_counts=None, min_genes=None, max_counts=None, max_genes=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]
Scanpy filtering cell transformation with additional options.
Allow passing gene counts as ratio
- Parameters:
min_counts (
Optional
[int
]) – Minimum number of counts required for a cell to be kept.min_genes (
Union
[float
,int
,None
]) – Minimum number (or ratio) of genes required for a cell to be kept.max_counts (
Optional
[int
]) – Maximum number of counts required for a cell to be kept.max_genes (
Union
[float
,int
,None
]) – Maximum number (or ratio) of genes required for a cell to be kept.split_name (
Optional
[str
]) – Which split to be used for filtering.channel (
Optional
[str
]) – Channel to be used for filtering.channel_type (
Optional
[str
]) – Channel type to be used for filtering.
- class dance.transforms.FilterGenesCommon(batch_key=None, split_keys=None, **kwargs)[source]
Filter genes by taking the common genes across batches or splits.
- Parameters:
batch_key (
Optional
[str
]) – Which column in the.obs
table to be used to distinguishing batches.split_keys (
Optional
[List
[str
]]) – A list of split names, e.g., ‘train’, to be used to find common gnees.
Note
One and only one of
batch_key
orsplit_keys
can be specified.
- class dance.transforms.FilterGenesMarker(*, ct_profile_channel='CellTopicProfile', subset=True, label=None, threshold=1.25, eps=1e-06, **kwargs)[source]
Select marker genes based on log fold-change.
- Parameters:
ct_profile_channel (
str
) – Name of the.varm
channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).subset (
bool
) – If set toTrue
, then inplace subset the variables to only contain the markers.label (
Optional
[str
]) – If set, e.g., to'marker'
, then save the marker indicator to theobs
column named asmarker
.threshold (
float
) – Threshold value of the log fol-change above which the gene will be considered as a marker.eps (
float
) – A small value that prevents taking log of zeros.
- class dance.transforms.FilterGenesMarkerGini(*, ct_profile_channel='CellGiottoTopicProfile', ct_profile_detection_channel='CellGiottoDetectionTopicProfile', subset=True, label=None, **kwargs)[source]
Select marker genes based on Gini coefficient.
Identfy marker genes for all clusters in a one vs all manner based on Gini coefficients, a measure for inequality.
- Parameters:
ct_profile_channel (
str
) – Name of the.varm
channel that contains the cell-topic profile which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).ct_profile_detection_channel (
str
) – Name of the.varm
channel that contains the cell-topic profile nums which greater than some value which will be used to compute the log fold-changes for each cell-topic (e.g., cell type).subset (
bool
) – If set toTrue
, then inplace subset the variables to only contain the markers.label (
Optional
[str
]) – If set, e.g., to'marker'
, then save the marker indicator to theobs
column named asmarker
.Reference –
--------- –
https (//genomebiology.biomedcentral.com/articles/10.1186/s13059-016-1010-4?ref=https://githubhelp.com) –
- class dance.transforms.FilterGenesMatch(prefixes=None, suffixes=None, case_sensitive=False, **kwargs)[source]
Filter genes based on prefixes and suffixes.
- Parameters:
prefixes (
Optional
[List
[str
]]) – List of prefixes to remove.suffixes (
Optional
[List
[str
]]) – List of suffixes to remove.case_sensitive (bool) –
- class dance.transforms.FilterGenesPercentile(min_val=1, max_val=99, *, mode='sum', channel=None, channel_type=None, whitelist_indicators=None, **kwargs)[source]
Filter genes based on percentiles of the summarized gene expressions.
- Parameters:
min_val (
Optional
[float
]) – Minimum percentile of the summarized expression value below which the genes will be discarded.max_val (
Optional
[float
]) – Maximum percentile of the summarized expression value above which the genes will be discarded.mode (
Literal
['sum'
,'cv'
,'rv'
,'var'
]) – Summarization mode. Available options are[sum|var|cv|rv]
.sum
calculates the sum of expression values,var
calculates the variance of the expression values,cv
uses the coefficient of variation (std / mean ), andrv
uses the relative variance (var / mean).channel (
Optional
[str
]) – Which channel, more specificailly,layers
, to use. Use the default.X
if not set. Ifchannel
is specified, then need to specifychannel_type
to belayers
as well.channel_type (
Optional
[str
]) – Type of channels specified. Only allowNone
(the default setting) orlayers
(whenchannel
is specified).whitelist_indicators (
Union
[List
[str
],str
,None
]) – A list of (or a single)var
columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.
- class dance.transforms.FilterGenesRegression(method, num_genes=400, *, channel=None, mod=None, skip_count_check=False, **kwargs)[source]
Select genes based on regression.
- Parameters:
method (
str
) – What regression based gene selection methtod to use. Supported options are:"enclasc"
,"seurat3"
, and"scmap"
.num_genes (
int
) – Number of genes to select.channel (str | None) –
mod (str | None) –
skip_count_check (bool) –
Note
The implementation is adapted from the EnClaSC GitHub repo: https://github.com/xy-chen16/EnClaSC
https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-03679-z
- class dance.transforms.FilterGenesScanpy(min_counts=None, min_cells=None, max_counts=None, max_cells=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]
Scanpy filtering gene transformation with additional options.
- Parameters:
min_counts (
Optional
[int
]) – Minimum number of counts required for a gene to be kept.min_cells (
Union
[float
,int
,None
]) – Minimum number (or ratio) of cells required for a gene to be kept.max_counts (
Optional
[int
]) – Maximum number of counts required for a gene to be kept.max_cells (
Union
[float
,int
,None
]) – Maximum number (or ratio) of cells required for a gene to be kept.split_name (
Optional
[str
]) – Which split to be used for filtering.channel (
Optional
[str
]) – Channel to be used for filtering.channel_type (
Optional
[str
]) – Channel type to be used for filtering.
- class dance.transforms.FilterGenesTopK(num_genes, top=True, *, mode='cv', channel=None, channel_type='X', whitelist_indicators=None, **kwargs)[source]
Select top/bottom genes based on the summarized gene expressions.
- Parameters:
num_genes (
int
) – Number of genes to be selected.top (
bool
) – If set toTrue
, then use the genes with highest values of the specified gene summary stats.mode (
Literal
['sum'
,'cv'
,'rv'
,'var'
]) – Summarization mode. Available options are[sum|var|cv|rv]
.sum
calculates the sum of expression values,var
calculates the variance of the expression values,cv
uses the coefficient of variation (std / mean ), andrv
uses the relative variance (var / mean).channel (
Optional
[str
]) – Which channel, more specificailly,layers
, to use. Use the default.X
if not set. Ifchannel
is specified, then need to specifychannel_type
to belayers
as well.channel_type (
Optional
[str
]) – Type of channels specified. Only allowNone
(the default setting) orlayers
(whenchannel
is specified).whitelist_indicators (
Union
[List
[str
],str
,None
]) – A list of (or a single)var
columns that indicates the genes to be excluded from the filtering process. Note that these genes will still be used in the summary stats computation, and thus will still contribute to the threshold percentile. If not set, then no genes will be excluded from the filtering process.
- class dance.transforms.FilterScanpy(min_counts=None, min_genes_or_cells=None, max_counts=None, max_genes_or_cells=None, split_name=None, channel=None, channel_type='X', **kwargs)[source]
Scanpy filtering transformation with additional options.
- Parameters:
min_counts (int | None) –
min_genes_or_cells (float | int | None) –
max_counts (int | None) –
max_genes_or_cells (float | int | None) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
- class dance.transforms.GeneHoldout(n_top=5, batch_size=512, random_state=None, **kwargs)[source]
Progressively hold out genes for DeepImpute.
Split genes into target batches. For every target gene in one batch, refer to the genes that are not in this batch and select predictor genes with high covariance with target gene.
- Parameters:
n_top (
int
) – Number of predictor genes per target gene.batch_size (
int
) – Target batch size.random_state (
Optional
[int
]) – Random state.
- class dance.transforms.GeneStats(genestats_select='all', *, fill_na=None, threshold=0, pseudo=False, split_name='train', channel=None, channel_type=None, **kwargs)[source]
Gene statistics computation.
- Parameters:
genestats_select (
Union
[str
,List
[str
]]) – List of names of the gene stats functions to use. If set to"all"
(by default), then use all available gene stats functions.fill_na (
Optional
[float
]) – If not set (default), then do not fill nans. Otherwise, fill nans with the specified value.threshold (
float
) – Threshold value for filtering gene expression when computing stats, e.g., mean expression values.pseudo (
bool
) – If set toTrue
, then add1
to the numerator and denominator when computing the ratio (alpha
) for which the gene expression values are above the specifiedthreshold
.split_name (
Optional
[str
]) – Which split to compute the gene stats on.channel (str | None) –
channel_type (str | None) –
- class dance.transforms.MaskData(mask_rate=0.1, seed=None, **kwargs)[source]
Randomly mask data.
Randomly mask positive counts according to masking rate.
- Parameters:
mask_rate (
Optional
[float
]) – Masking rate.seed (
Optional
[int
]) – Random seed.
- class dance.transforms.MorphologyFeatureCNN(*, model_name='resnet50', n_components=50, random_state=0, crop_size=20, target_size=299, device='cpu', channels=('spatial_pixel', 'image'), channel_types=('obsm', 'uns'), **kwargs)[source]
Cell morphological features extracted from CNN.
- Parameters:
model_name (
str
) – Pretrained CNN name:"resnet50"
,"inceptron_v3"
,"xception"
,"vgg16"
.n_components (
int
) – Number of feature dimension.crop_size (
int
) – Cell image cropping size (cropped as square centered around the target cell).target_size (
int
) – Target patch size.Reference –
--------- –
https (//doi.org/10.1101/2020.05.31.125658) –
random_state (int) –
device (str) –
channels (Sequence[str]) –
channel_types (Sequence[str]) –
- class dance.transforms.PseudoMixture(*, n_pseudo=1000, nc_min=2, nc_max=10, ct_select='auto', ct_key='cellType', channel=None, channel_type='X', random_state=0, prefix='ps_mix_', in_split_name='ref', out_split_name='pseudo', label_batch=False, **kwargs)[source]
Pseudo mixture generation.
- Parameters:
n_pseudo (int) –
nc_min (int) –
nc_max (int) –
ct_select (Literal['auto'] | ~typing.List[str]) –
ct_key (str) –
channel (str | None) –
channel_type (str | None) –
random_state (int | None) –
prefix (str) –
in_split_name (str) –
out_split_name (str | None) –
label_batch (bool) –
- class dance.transforms.RemoveSplit(*, split_name, **kwargs)[source]
Remove a particular split from the data.
- Parameters:
split_name (str) –
- class dance.transforms.SCNFeature(num_top_genes=10, alpha1=0.05, alpha2=0.001, mu=2, num_top_gene_pairs=25, max_gene_per_ct=3, *, split_name='train', channel=None, channel_type=None, **kwargs)[source]
Differential gene-pair feature used in SingleCellNet.
- Parameters:
num_top_genes (int) –
alpha1 (float) –
alpha2 (float) –
mu (float) –
num_top_gene_pairs (int) –
max_gene_per_ct (int) –
split_name (str | None) –
channel (str | None) –
channel_type (str | None) –
- class dance.transforms.SMEFeature(n_neighbors=3, n_components=50, random_state=0, *, channels=(None, 'SMEGraph'), channel_types=(None, 'obsp'), **kwargs)[source]
Spatial Morphological gene Expression normalization feature from stLearn.
- Parameters:
n_neighbors (
int
) – Number of spatial spots neighbors to consider.n_components (
int
) – Number of feature dimension.Reference –
--------- –
https (//doi.org/10.1101/2020.05.31.125658) –
random_state (int) –
channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –
- class dance.transforms.SaveRaw(exist_ok=False, **kwargs)[source]
Save raw data.
See
anndata.AnnData.raw()
for more information.- Parameters:
exist_ok (
bool
) – If set to False, then raise an exception if theraw
attribute is already set.
- class dance.transforms.ScTransform(split_names=None, batch_key=None, min_cells=5, gmean_eps=1, n_genes=2000, n_cells=None, bin_size=500, bw_adjust=3, **kwargs)[source]
ScTransform normalization and variance stabiliation.
Note
This is a Python implementation adapted from https://github.com/atarashansky/SCTransformPy
- Parameters:
split_names (
Union
[Literal
['ALL'
],List
[str
],None
]) – Which split(s) to apply the transformation.batch_key (
Optional
[str
]) – Key for batch information.min_cells (
int
) – Minimum number of cells the gene has to express in, below which that gene will be discarded.gmean_eps (
int
) – Pseudocount.n_genes (
Optional
[int
]) – Maximum number of genes to use. Use all if set toNone
.n_cells (
Optional
[int
]) – maximum number of cells to use. Use all if set toNone
.bin_size (
int
) – Number of genes a single bin contain.bw_adjust (
float
) – Bandwidth adjusting parameter.Reference –
--------- –
https (//genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1874-1) –
- class dance.transforms.ScaleFeature(*, axis=0, split_names=None, batch_key=None, mode='normalize', eps=-1, **kwargs)[source]
Scale the feature matrix in the AnnData object.
This is an extension of
scanpy.pp.scale()
, allowing split- or batch-wide scaling.- Parameters:
axis (
int
) – Axis along which the scaling is performed.split_names (
Union
[Literal
['ALL'
],List
[str
],None
]) – Indicate which splits to perform the scaling independently. If set to ‘ALL’, then go through all splits available in the data.batch_key (
Optional
[str
]) – Indicate which column in.obs
to use as the batch index to guide the batch-wide scaling.mode (
Literal
['normalize'
,'standardize'
,'minmax'
,'l2'
]) – Scaling mode, seedance.utils.matrix.normalize()
for more information.eps (
float
) – Correction fact, seedance.utils.matrix.normalize()
for more information.
Note
The order of checking split- or batch-wide scaling mode is: batch_key > split_names > None (i.e., all).
- class dance.transforms.SetConfig(config_dict, **kwargs)[source]
Set configuration options of a dance data object.
- Parameters:
config_dict (
Dict
[str
,Any
]) – Dance data object configuration dictionary. Seeset_config_from_dict()
.
- class dance.transforms.SpatialIDEFeature(channels=(None, 'spatial'), channel_types=(None, 'obsm'), **kwargs)[source]
Spatial IDE feature.
The SpatialDE model is based on the assumption of normally distributed residual noise and independent observations across cells. There are two normalization steps:
Variance-stabilizing transformation for negative-binomial-distributed data (Anscombe’s transformation).
Regress log total count values out from the Anscombe-transformed expression values.
Reference
https://www.nature.com/articles/nmeth.4636#Sec2
- regress_out(sample_info, expression_matrix, covariate_formula, design_formula='1', rcond=-1)[source]
Implementation of limma’s removeBatchEffect function.
- stabilize(expression_matrix)[source]
Use Anscombes approximation to variance stabilize Negative Binomial data.
See https://f1000research.com/posters/4-1041 for motivation.
Assumes columns are samples, and rows are genes
- Parameters:
channels (Sequence[str | None]) –
channel_types (Sequence[str | None]) –
- class dance.transforms.WeightedFeaturePCA(n_components=400, split_name=None, feat_norm_mode=None, feat_norm_axis=0, **kwargs)[source]
Compute the weighted gene PCA as cell features.
Given a gene expression matrix of dimension (cell x gene), the gene PCA is first compured. Then, the representation of each cell is computed by taking the weighted sum of the gene PCAs based on that cell’s gene expression values.
- Parameters:
n_components (
int
) – Number of PCs to use.split_name (
Optional
[str
]) – Which split to use to compute the gene PCA. If not set, use all data.feat_norm_mode (str | None) –
feat_norm_axis (int) –