dance.data

class dance.data.base.BaseData(data, train_size=None, val_size=0, test_size=-1, split_index_range_dict=None, full_split_name=None)[source]

Base data object.

The dance data object is a wrapper of the AnnData object, with several utility methods to help retrieving data in specific splits in specific format (see get_split_idx() and get_feature()). The AnnData objcet is saved in the attribute data and can be accessed directly.

Warning

Since the underlying data object is a reference to the input AnnData object, please be extra cautious *NOT* initializing two different dance data object using the same AnnData object! If you are unsure, we recommend always initialize the dance data object using a copy of the input AnnData object, e.g.,

>>> adata = anndata.AnnData(...)
>>> ddata = dance.data.Data(adata.copy())

Note

You can directly access some main properties of AnnData (or MuData depending on which type of data you passed in), such as X, obs, var, and etc.

Parameters:
  • data (Union[AnnData, MuData]) – Cell data.

  • train_size (Optional[int]) – Number of cells to be used for training. If not specified, not splits will be generated.

  • val_size (int) – Number of cells to be used for validation. If set to -1, use what’s left from training and testing.

  • test_size (int) – Number of cells to be used for testing. If set to -1, used what’s left from training and validation.

  • split_index_range_dict (Dict[str, Tuple[int, int]] | None) –

  • full_split_name (str | None) –

append(data, *, mode='merge', rename_dict=None, new_split_name=None, label_batch=False, **concat_kwargs)[source]

Append another dance data object to the current data object.

Parameters:
  • data – New dance data object to be added.

  • mode (Optional[Literal['merge', 'rename', 'new_split']]) – How to combine the splits from the new data and the current data. (1) "merge": merge the splits from the data, e.g., the training indexes from both data are used as the training indexes in the new combined data. (2) "rename": rename the splits of the new data and add to the current split index dictionary, e.g., renaming ‘train’ to ‘ref’. Requires passing the rename_dict. Raise an error if the newly renamed key is already used in the current split index dictionary. (3) "new_split": assign the whole new data to a new split. Requires pssing the new_split_name that is not already used as a split name in the current data. (4) None: do not specify split index to the newly added data.

  • rename_dict (Optional[Dict[str, str]]) – Optional argument that is only used when mode="rename". A dictionary to map the split names in the new data to other names.

  • new_split_name (Optional[str]) – Optional argument that is only used when mode="new_split". Name of the split to assign to the new data.

  • label_batch (bool) – Add “batch” column to .obs when set to True.

  • **concat_kwargs – See anndata.concat().

property config: Dict[str, Any]

Return the dance data object configuration dict.

Notes

The configuration dictionary is saved in the data attribute, which is an AnnData object. Inparticular, the config will be saved in the .uns attribute with the key "dance_config".

get_feature(*, split_name=None, return_type='numpy', channel=None, channel_type='obsm', mod=None)[source]

Retrieve features from data.

Parameters:
  • split_name (Optional[str]) – Name of the split to retrieve. If not set, return all.

  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) – How should the features be returned. sparse: return as a sparse matrix; numpy: return as a numpy array; torch: return as a torch tensor; anndata: return as an anndata object.

  • channel (Optional[str]) – Return a particular channel as features. If channel_type is X or raw_X, then return .X or the .raw.X attribute from the AnnData directly. If channel_type is obs, return the column named by channel, similarly for var. Finally, if channel_type is obsm, obsp, varm, varp, layers, or uns, then return the value correspond to the channel in the dictionary.

  • channel_type (Optional[str]) – Channel type to use, default to obsm (will be changed to X in the near future).

  • mod (Optional[str]) – Modality to use, default to None. Options other than None are only available when the underlying data object is Mudata.

get_split_data(split_name)[source]

Obtain the underlying data of a particular split.

Parameters:

split_name (str) – Name of the split to retrieve.

Return type:

Union[AnnData, MuData]

get_split_idx(split_name, error_on_miss=False)[source]

Obtain cell indices for a particular split.

Parameters:
  • split_name (str) – Name of the split to retrieve.

  • error_on_miss (bool) – If set to True, raise KeyError if the queried split does not exit, otherwise return None.

See also

get_split_mask()

get_split_mask(split_name, return_type='numpy')[source]

Obtain mask representation of a particular split.

Parameters:
  • split_name (str) – Name of the split to retrieve.

  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) – Return numpy array if set to ‘numpy’, or torch Tensor if set to ‘torch’.

Return type:

Union[ndarray, Tensor]

set_config(*, overwrite=False, **kwargs)[source]

Set dance data object configuration.

See :meth: ~BaseData.set_config_from_dict.

Parameters:

overwrite (bool) –

set_config_from_dict(config_dict, *, overwrite=False)[source]

Set dance data object configuration from a config dict.

Parameters:
  • config_dict (Dict[str, Any]) – Configuration dictionary.

  • overwrite (bool) – Used to determine the behaviour of resolving config conflicts. In the case of a conflict, where the config dict passed contains a key with value that differs from an existing setting, if overwrite is set to False, then raise a KeyError. Otherwise, overwrite the configuration with the new values.

set_split_idx(split_name, split_idx)[source]

Set cell indices for a particular split.

Parameters:
  • split_name (str) – Name of the split to set.

  • split_idx (Sequence[int]) – Indices to be used in this split.

class dance.data.Data(data, train_size=None, val_size=0, test_size=-1, split_index_range_dict=None, full_split_name=None)[source]
Parameters:
  • data (AnnData | MuData) –

  • train_size (int | None) –

  • val_size (int) –

  • test_size (int) –

  • split_index_range_dict (Dict[str, Tuple[int, int]] | None) –

  • full_split_name (str | None) –

get_data(split_name=None, return_type='numpy', x_kwargs={}, y_kwargs={})[source]

Retrieve cell features and labels from a particular split.

Parameters:
  • split_name (Optional[str]) – Name of the split to retrieve. If not set, return all.

  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) – How should the features be returned. numpy: return as a numpy array; torch: return as a torch tensor; anndata: return as an anndata object.

  • x_kwargs (Dict[str, Any]) –

  • y_kwargs (Dict[str, Any]) –

Return type:

Tuple[Any, Any]

get_test_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]

Retrieve cell features and labels from the ‘test’ split.

Return type:

Tuple[Any, Any]

Parameters:
  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –

  • x_kwargs (Dict[str, Any]) –

  • y_kwargs (Dict[str, Any]) –

get_train_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]

Retrieve cell features and labels from the ‘train’ split.

Return type:

Tuple[Any, Any]

Parameters:
  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –

  • x_kwargs (Dict[str, Any]) –

  • y_kwargs (Dict[str, Any]) –

get_val_data(return_type='numpy', x_kwargs={}, y_kwargs={})[source]

Retrieve cell features and labels from the ‘val’ split.

Return type:

Tuple[Any, Any]

Parameters:
  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –

  • x_kwargs (Dict[str, Any]) –

  • y_kwargs (Dict[str, Any]) –

get_x(split_name=None, return_type='numpy', **kwargs)[source]

Retrieve cell features from a particular split.

Return type:

Any

Parameters:
  • split_name (str | None) –

  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –

get_y(split_name=None, return_type='numpy', **kwargs)[source]

Retrieve cell labels from a particular split.

Return type:

Any

Parameters:
  • split_name (str | None) –

  • return_type (Literal['anndata', 'default', 'numpy', 'torch', 'sparse']) –