singler package

Submodules

singler.aggregate_reference module

singler.aggregate_reference.aggregate_reference(ref_data, ref_labels, ref_features, num_centers=None, power=0.5, num_top=1000, rank=20, assay_type='logcounts', subset_row=None, check_missing=True, num_threads=1)[source]

Aggregate reference samples for a given label by using vector quantization to average their count profiles. The idea is to reduce the size of single-cell reference datasets so as to reduce the computation time of train_single(). We perform k-means clustering for all cells in each label and aggregate all cells within each k-means cluster. (More specifically, the clustering is done on the principal components generated from the highly variable genes to better capture the structure within each label.) This yields one or more profiles per label, reducing the number of separate observations while preserving some level of intra-label heterogeneity.

Parameters:
  • ref_data (Any) – Floating-point matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix.

  • ref_labels (Sequence) – Array of length equal to the number of columns in ref_data, containing the labels for each cell.

  • ref_features (Sequence) – Sequence of identifiers for each feature, i.e., row in ref_data.

  • num_centers (Optional[int]) – Maximum number of aggregated profiles to produce for each label with cluster_kmeans(). If None, a suitable number of profiles is automatically chosen.

  • power (float) – Number between 0 and 1 indicating how much aggregation should be performed. Specifically, we set the number of clusters to X**power where X is the number of cells assigned to that label. Ignored if num_centers is not None.

  • num_top (int) – Number of highly variable genes to use for PCA prior to clustering, see choose_highly_variable_genes().

  • rank (int) – Number of principal components to use during clustering, see run_pca().

  • assay_type (Union[int, str]) – Integer or string specifying the assay of ref_data containing the relevant expression matrix, if ref is a SummarizedExperiment object.

  • subset_row (Optional[Sequence]) – Array of row indices specifying the rows of ref_data to use for clustering. If None, no additional filtering is performed. Note that even if subset_row is provided, aggregation is still performed on all genes.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values from ref_data.

  • num_threads (int) – Number of threads to use.

Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment containing the aggregated values in its first assay. The label for each aggregated profile is stored in the column data.

singler.annotate_integrated module

singler.annotate_integrated.annotate_integrated(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type='logcounts', test_check_missing=False, ref_check_missing=True, train_single_args={}, classify_single_args={}, train_integrated_args={}, classify_integrated_args={}, num_threads=1)[source]

Annotate a single-cell expression dataset based on the correlation of each cell to profiles in multiple labelled references, where the annotation from each reference is then integrated across references.

Parameters:
  • test_data (Any) –

    A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column is used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_data (Sequence) –

    Sequence consisting of one or more of the following:

    • A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the ref_data argument in train_single()).

    • A SummarizedExperiment object containing such a matrix in its assays.

  • ref_labels (list[Sequence]) – Sequence of the same length as ref_data. The i-th entry should be a sequence of length equal to the number of columns of ref_data[i], containing the label associated with each column.

  • test_features (Optional[Sequence]) – Sequence of length equal to the number of rows in test_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • ref_features (Optional[list[Optional[Sequence]]]) –

    Sequence of the same length as ref_data. The i-th entry should be a sequence of length equal to the number of rows of ref_data[i], containing the feature identifier associated with each row. It can also be set to None to use the row names of the experiment as features.

    This can also be None to indicate that the row names should be used for all references, assuming ref_data only contains SummarizedExperiment objects.

  • test_assay_type (Union[str, int]) – Assay of test_data containing the expression matrix, if test_data is a SummarizedExperiment.

  • test_check_missing (bool) – Whether to check for and remove missing (i.e., NaN) values from the test dataset.

  • ref_assay_type (Union[str, int]) – Assay containing the expression matrix for any entry of ref_data that is a SummarizedExperiment.

  • ref_check_missing (bool) – Whether to check for and remove missing (i.e., NaN) values from the reference datasets.

  • train_single_args (dict) – Further arguments to pass to train_single().

  • classify_single_args (dict) – Further arguments to pass to classify_single().

  • train_integrated_args (dict) – Further arguments to pass to train_integrated().

  • classify_integrated_args (dict) – Further arguments to pass to classify_integrated().

  • num_threads (int) – Number of threads to use for the various steps.

Return type:

Tuple[list[BiocFrame], BiocFrame]

Returns:

Tuple where the first element contains per-reference results (i.e. a list of BiocFrame outputs, roughly equivalent to running annotate_single() on each reference) and the second element contains integrated results across references (i.e., a BiocFrame from classify_integrated()).

singler.annotate_single module

singler.annotate_single.annotate_single(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type=0, test_check_missing=False, ref_check_missing=True, train_args={}, classify_args={}, num_threads=1)[source]

Annotate a single-cell expression dataset based on the correlation of each cell to profiles in a labelled reference.

Parameters:
  • test_data (Any) –

    A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column will be used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays. Non-default assay types can be specified in classify_args.

  • ref_data (Any) –

    A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the ref argument in train_single()).

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays. Non-default assay types can be specified in classify_args.

  • ref_labels (Sequence) – Sequence of length equal to the number of columns of ref_data, containing the label associated with each column.

  • test_features (Optional[Sequence]) – Sequence of length equal to the number of rows in test_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • ref_features (Optional[Sequence]) – Sequence of length equal to the number of rows of ref_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • test_assay_type (Union[str, int]) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • ref_assay_type (Union[str, int]) – Assay containing the expression matrix, if ref_data is a SummarizedExperiment.

  • test_assay_type – Whether to remove rows with missing values from the test dataset.

  • ref_assay_type – Whether to remove rows with missing values from the reference dataset.

  • train_args (dict) – Further arguments to pass to train_single().

  • classify_args (dict) – Further arguments to pass to classify_single().

  • num_threads (int) – Number of threads to use for the various steps.

Return type:

BiocFrame

Returns:

A BiocFrame of labelling results, see classify_single() for details.

singler.classify_integrated module

singler.classify_integrated.classify_integrated(test_data, results, integrated_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]

Integrate classification results across multiple references for a single test dataset.

Parameters:
  • test_data (Any) –

    A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and/or transformed expression values are also acceptable as only the ranking is used within this function.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • results (list[BiocFrame]) – List of classification results generated by running classify_single() on test_data with each reference. References should be in the same order as that used to construct integrated_prebuilt.

  • integrated_prebuilt (TrainedIntegratedReferences) – Integrated reference object, constructed with train_integrated().

  • assay_type (Union[str, int]) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • quantile (float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.

  • use_fine_tune (bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between references with similar best labels but requires more computational work.

  • fine_tune_threshold (float) – Maximum difference from the maximum correlation to use in fine-tuning. All references above this threshold are used for another round of fine-tuning.

  • num_threads (int) – Number of threads to use during classification.

Return type:

BiocFrame

Returns:

A BiocFrame containing the best_label across all references, defined as the assigned label in the best reference; the identity of the best_reference, either as a name string or an integer index; the scores for the best label in each reference, as a nested BiocFrame; and the delta from the best to the second-best reference. Each row corresponds to a column of test_data.

singler.classify_single module

singler.classify_single.classify_single(test_data, ref_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]

Classify a test dataset against a reference by assigning labels from the latter to each column of the former using the SingleR algorithm.

Parameters:
  • test_data (Any) –

    A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and transformed expression values are also acceptable as only the ranking is used within this function.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_prebuilt (TrainedSingleReference) – A pre-built reference created with train_single().

  • assay_type (Union[str, int]) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • quantile (float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.

  • use_fine_tune (bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between similar labels but requires more computational work.

  • fine_tune_threshold (float) – Maximum difference from the maximum correlation to use in fine-tuning. All labels above this threshold are used for another round of fine-tuning.

  • num_threads (int) – Number of threads to use during classification.

Return type:

BiocFrame

Returns:

A BiocFrame containing the best label, the scores for each label as a nested BiocFrame, and the delta from the best to the second-best label. Each row corresponds to a column of test. The metadata contains markers, a list of the markers from each pairwise comparison between labels; and used, a list containing the union of markers from all comparisons.

singler.get_classic_markers module

singler.get_classic_markers.get_classic_markers(ref_data, ref_labels, ref_features, assay_type='logcounts', check_missing=True, num_de=None, num_threads=1)[source]

Compute markers from a reference using the classic SingleR algorithm. This is typically done for reference datasets derived from replicated bulk transcriptomic experiments.

Parameters:
  • ref_data (Union[Any, list[Any]]) –

    A matrix-like object containing the log-normalized expression values of a reference dataset. Each column is a sample and each row is a feature.

    Alternatively, this can be a SummarizedExperiment containing a matrix-like object in one of its assays.

    Alternatively, a list of such matrices or SummarizedExperiment objects, typically for multiple batches of the same reference; it is assumed that different batches exhibit at least some overlap in their ref_features and ref_labels.

  • ref_labels (Union[Sequence, list[Sequence]]) –

    If ref_data is not a list, ref_labels should be a sequence of length equal to the number of columns of ref_data, containing a label (usually a string) for each column.

    If ref_data is a list, ref_labels should also be a list of the same length. Each entry should be a sequence of length equal to the number of columns of the corresponding entry of ref_data.

  • ref_features (Union[Sequence, list[Sequence]]) –

    If ref_data is not a list, ref_features should be a sequence of length equal to the number of rows of ref_data, containing the feature name (usually a string) for each row.

    If ref_data is a list, ref_features should also be a list of the same length. Each entry should be a sequence of length equal to the number of rows of the corresponding entry of ref.

  • assay_type (Union[str, int]) – Name or index of the assay of interest, if ref is or contains SummarizedExperiment objects.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values in the reference matrices. This can be set to False if it is known that no NaN values exist.

  • num_de (Optional[int]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. If None, an appropriate number of genes is automatically determined.

  • num_threads (int) – Number of threads to use for the calculations.

Return type:

dict[Any, dict[Any, list]]

Returns:

A dictionary of dictionary of lists containing the markers for each pairwise comparison between labels, i.e., markers[a][b] contains the upregulated markers for label a over label b.

singler.get_classic_markers.number_of_classic_markers(num_labels)[source]

Compute the number of markers to detect for a given number of labels, using the classic SingleR marker detection algorithm.

Parameters:

num_labels (int) – Number of labels.

Returns:

Number of markers.

Return type:

int

singler.train_integrated module

class singler.train_integrated.TrainedIntegratedReferences(ptr, ref_labels)[source]

Bases: object

Object containing integrated references, typically constructed by train_integrated().

property reference_labels: list

List of lists containing the names of the labels for each reference.

Each entry corresponds to a reference in reference_names, if reference_names is not None.

singler.train_integrated.train_integrated(test_features, ref_prebuilt, warn_lost=True, num_threads=1)[source]

Build a set of integrated references for classification of a test dataset.

Parameters:
  • test_features (Sequence) – Sequence of features for the test dataset.

  • ref_prebuilt (list[TrainedSingleReference]) – List of prebuilt references, typically created by calling train_single().

  • warn_lost (bool) – Whether to emit a warning if the markers for each reference are not all present in all references.

  • num_threads (int) – Number of threads.

Return type:

TrainedIntegratedReferences

Returns:

Integrated references for classification with classify_integrated().

singler.train_single module

class singler.train_single.TrainedSingleReference(ptr, full_data, full_label_codes, labels, features, markers)[source]

Bases: object

A prebuilt reference object, typically created by train_single(). This is intended for advanced users only and should not be serialized.

property features: list

The universe of features known to this reference.

property labels: Sequence

Unique labels in this reference.

marker_subset(indices_only=False)[source]
Parameters:

indices_only (bool) – Whether to return the markers as indices into features, or as a list of feature identifiers.

Return type:

Union[ndarray, list]

Returns:

If indices_only = False, a list of feature identifiers for the markers.

If indices_only = True, a NumPy array containing the integer indices of features in features that were chosen as markers.

property markers: dict[Any, dict[Any, list]]

Markers for every pairwise comparison between labels.

num_labels()[source]
Return type:

int

Returns:

Number of unique labels in this reference.

num_markers()[source]
Return type:

int

Returns:

Number of markers to be used for classification. This is the same as the size of the array from marker_subset().

singler.train_single.train_single(ref_data, ref_labels, ref_features, test_features=None, assay_type='logcounts', restrict_to=None, check_missing=True, markers=None, marker_method='classic', num_de=None, marker_args={}, aggregate=False, aggregate_args={}, nn_parameters=<knncolle.vptree.VptreeParameters object>, num_threads=1)[source]

Build a single reference dataset in preparation for classification.

Parameters:
  • ref_data (Any) –

    A matrix-like object where rows are features, columns are reference profiles, and each entry is the expression value. If markers is not provided, expression should be normalized and log-transformed in preparation for marker prioritization via differential expression analyses. Otherwise, any expression values are acceptable as only the ranking within each column is used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_labels (Sequence) – Sequence of labels for each reference profile, i.e., column in ref_data.

  • ref_features (Sequence) – Sequence of identifiers for each feature, i.e., row in ref_data.

  • test_features (Optional[Sequence]) – Sequence of identifiers for each feature in the test dataset.

  • assay_type (Union[str, int]) – Assay containing the expression matrix, if ref_data is a SummarizedExperiment.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values from ref_data.

  • restrict_to (Union[set, dict, None]) – Subset of available features to restrict to. Only features in restrict_to will be used in the reference building. If None, no restriction is performed.

  • markers (Optional[dict[Any, dict[Any, Sequence]]]) – Upregulated markers for each pairwise comparison between labels. Specifically, markers[a][b] should be a sequence of features that are upregulated in a compared to b. All such features should be present in features, and all labels in labels should have keys in the inner and outer dictionaries.

  • marker_method (Literal['classic', 'auc', 'cohens_d']) – Method to identify markers from each pairwise comparisons between labels in ref_data. If classic, we call get_classic_markers(). If auc or cohens_d, we call score_markers(). Only used if markers is not supplied.

  • num_de (Optional[int]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. If None and marker_method = "classic", an appropriate number of genes is determined by get_classic_markers(). Otherwise, it is set to 10. Only used if markers is not supplied.

  • marker_args (dict) – Further arguments to pass to the chosen marker detection method. If marker_method = "classic", this is get_classic_markers(), otherwise it is score_markers(). Only used if markers is not supplied.

  • aggregate (bool) – Whether the reference dataset should be aggregated to pseudo-bulk samples for speed, see aggregate_reference() for details.

  • aggregate_args (dict) – Further arguments to pass to aggregate_reference() when aggregate = True.

  • nn_parameters (Optional[Parameters]) – Algorithm for constructing the neighbor search index, used to compute scores during classification.

  • num_threads (int) – Number of threads to use for reference building.

Return type:

TrainedSingleReference

Returns:

The pre-built reference, ready for use in downstream methods like classify_single().