singler package

class singler.TrainedIntegratedReferences(ptr, ref_labels, ref_names)[source]

Bases: object

Integrated references, typically constructed by train_integrated(). This is intended for advanced users only and should not be serialized.

__annotate_func__ = None
__annotations_cache__ = {}
__firstlineno__ = 13
__static_attributes__ = ('_labels', '_names', '_ptr')
property reference_labels: list

List of lists containing the names of the labels for each reference.

property reference_names: Names | None

Names of the references, or None if they were unnamed.

class singler.TrainedSingleReference(ptr, full_data, full_label_codes, labels, features, markers)[source]

Bases: object

A prebuilt reference object, typically created by train_single(). This is intended for advanced users only and should not be serialized.

__annotate_func__ = None
__annotations_cache__ = {}
__firstlineno__ = 17
__static_attributes__ = ('_features', '_full_data', '_full_label_codes', '_labels', '_markers', '_ptr')
property features: list

The universe of features known to this reference.

property labels: Sequence

Unique labels in this reference.

marker_subset(indices_only=False)[source]
Parameters:

indices_only (bool) – Whether to return the markers as indices into features, or as a list of feature identifiers.

Return type:

ndarray | StringList

Returns:

If indices_only = False, a list of feature identifiers for the markers.

If indices_only = True, a NumPy array containing the integer indices of features in features that were chosen as markers.

property markers: dict[Any, dict[Any, StringList]]

Markers for every pairwise comparison between labels.

num_labels()[source]
Return type:

int

Returns:

Number of unique labels in this reference.

num_markers()[source]
Return type:

int

Returns:

Number of markers to be used for classification. This is the same as the size of the array from marker_subset().

singler.aggregate_reference(ref_data, ref_labels, ref_features, num_centers=None, power=0.5, num_top=1000, rank=20, assay_type='logcounts', subset_row=None, check_missing=True, num_threads=1)[source]

Aggregate reference samples for a given label by using vector quantization to average their count profiles. The idea is to reduce the size of single-cell reference datasets so as to reduce the computation time of train_single(). We perform k-means clustering for all cells in each label and aggregate all cells within each k-means cluster. (More specifically, the clustering is done on the principal components generated from the highly variable genes to better capture the structure within each label.) This yields one or more profiles per label, reducing the number of separate observations while preserving some level of intra-label heterogeneity.

Parameters:
  • ref_data (Any) – Floating-point matrix of reference expression values, usually containing log-expression values. Alternatively, a SummarizedExperiment object containing such a matrix.

  • ref_labels (Sequence) – Array of length equal to the number of columns in ref_data, containing the labels for each cell.

  • ref_features (Sequence) – Sequence of identifiers for each feature, i.e., row in ref_data.

  • num_centers (int | None) – Maximum number of aggregated profiles to produce for each label with cluster_kmeans(). If None, a suitable number of profiles is automatically chosen.

  • power (float) – Number between 0 and 1 indicating how much aggregation should be performed. Specifically, we set the number of clusters to X**power where X is the number of cells assigned to that label. Ignored if num_centers is not None.

  • num_top (int) – Number of highly variable genes to use for PCA prior to clustering, see choose_highly_variable_genes().

  • rank (int) – Number of principal components to use during clustering, see run_pca().

  • assay_type (int | str) – Integer or string specifying the assay of ref_data containing the relevant expression matrix, if ref is a SummarizedExperiment object.

  • subset_row (Sequence | None) – Array of row indices specifying the rows of ref_data to use for clustering. If None, no additional filtering is performed. Note that even if subset_row is provided, aggregation is still performed on all genes.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values from ref_data.

  • num_threads (int) – Number of threads to use.

Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment containing the aggregated values in its first assay. The label for each aggregated profile is stored in the column data.

Examples

>>> # Mock up some log-expression data for a reference dataset.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=50)
>>> labels = ref.get_column_data()["label"]
>>> import scranpy
>>> ref = scranpy.normalize_rna_counts_se(ref)
>>>
>>> # Aggregation at different resolutions:
>>> aggr0_5 = singler.aggregate_reference(ref, labels, ref.get_row_names(), power=0.5)
>>> print(aggr0_5)
>>> aggr0 = singler.aggregate_reference(ref, labels, ref.get_row_names(), power=0) # i.e., centroids only.
>>> print(aggr0)
>>> aggr1 = singler.aggregate_reference(ref, labels, ref.get_row_names(), power=1) # i.e., no aggregation.
>>> print(aggr1)
singler.annotate_integrated(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type='logcounts', test_check_missing=False, ref_check_missing=True, train_single_args={}, classify_single_args={}, train_integrated_args={}, classify_integrated_args={}, num_threads=1)[source]

Annotate a single-cell expression dataset based on the correlation of each cell to profiles in multiple labelled references. The results from each reference are then combined across references.

Parameters:
  • test_data (Any) –

    A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries may be expression values of any kind, only the ranking within each column is used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_data (dict | Sequence | NamedList) –

    Sequence consisting of one or more of the following:

    • A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the ref_data argument in train_single()).

    • A SummarizedExperiment object containing such a matrix in its assays.

    Alternatively, a dictionary where each value is a matrix-like object or SummarizedExperiment and each key is the name of the reference.

    Alternatively, a (possibly named) NamedList where each entry is a matrix-like object or SummarizedExperiment.

  • ref_labels (dict | Sequence | NamedList) –

    Sequence of the same length as ref_data. The i-th entry should be a sequence of length equal to the number of columns of ref_data[i], containing the label associated with each column.

    Alternatively, a dictionary where each value is a sequence of column labels and each key is the name of the reference.

    Alternatively, a NamedList where each entry is a sequence of column names. If the NamedList is named, the names should be the same as those of ref_data.

  • test_features (Sequence | None) – Sequence of length equal to the number of rows in test_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • ref_features (dict | Sequence | NamedList | None) –

    Sequence of the same length as ref_data. The i-th entry should be a sequence of length equal to the number of rows of ref_data[i], containing the feature identifier associated with each row. Alternatively, the i-th entry may be set to None to use the row names of the experiment as features.

    Alternatively, a dictionary where each value is a sequence of feature names and each key is the name of the reference.

    Alternatively, a NamedList where each entry is a sequence of feature names. If the NamedList is named, the names should be the same as those of ref_data.

    If None, the row names are used for all references, assuming ref_data only contains SummarizedExperiment objects.

  • test_assay_type (str | int) – Assay of test_data containing the expression matrix, if test_data is a SummarizedExperiment.

  • test_check_missing (bool) – Whether to check for and remove missing (i.e., NaN) values from the test dataset.

  • ref_assay_type (str | int) – Assay containing the expression matrix for any entry of ref_data that is a SummarizedExperiment.

  • ref_check_missing (bool) – Whether to check for and remove missing (i.e., NaN) values from the reference datasets.

  • train_single_args (dict) – Further arguments to pass to train_single().

  • classify_single_args (dict) – Further arguments to pass to classify_single().

  • train_integrated_args (dict) – Further arguments to pass to train_integrated().

  • classify_integrated_args (dict) – Further arguments to pass to classify_integrated().

  • num_threads (int) – Number of threads to use for the various steps.

Return type:

NamedList

Returns:

A NamedList containing the following elements.

  • ”single”: a NamedList of BiocFrame objects, containing the per-reference classification results. Each entry is equivalent to running annotate_single() for test_data on the corresponding reference separately. If ref_data was named, the NamedList will also be named.

  • A BiocFrame from classify_integrated(), containing the integrated results across references.

Examples

>>> # Mocking up data.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=8)
>>> ref1 = ref[:,[True, False] * int(ref.shape[1]/2)]
>>> ref2 = ref[:,[False, True] * int(ref.shape[1]/2)]
>>>
>>> cd2 = ref2.get_column_data()
>>> label2 = [l.lower() for l in cd2["label"]] # converting to lower-case for some variety.
>>> cd2.set_column("label", label2, in_place=True)
>>> ref2.set_column_data(cd2, in_place=True)
>>>
>>> import scranpy
>>> ref1 = scranpy.normalize_rna_counts_se(ref1)
>>> ref2 = scranpy.normalize_rna_counts_se(ref2)
>>>
>>> # Classifying within and across references.
>>> test = singler.mock_test_data(ref)
>>> full_res = singler.annotate_integrated(
>>>     test,
>>>     [ref1, ref2],
>>>     [ref1.get_column_data()["label"], ref2.get_column_data()["label"]]
>>> )
>>>
>>> print(full_res["single"][0]) # i.e., classification against ref1
>>> print(full_res["single"][1]) # i.e., classification against ref2
>>> print(full_res["integrated"]) # combined classification results
singler.annotate_single(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type=0, test_check_missing=False, ref_check_missing=True, train_args={}, classify_args={}, num_threads=1)[source]

Annotate a single-cell expression dataset based on the correlation of each cell to profiles in a labelled reference.

Parameters:
  • test_data (Any) –

    A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries can be expression values of any kind; only the ranking within each column will be used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_data (Any) –

    A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the ref argument in train_single()).

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_labels (Sequence) – Sequence of length equal to the number of columns of ref_data, containing the label associated with each column.

  • test_features (Sequence | None) – Sequence of length equal to the number of rows in test_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • ref_features (Sequence | None) – Sequence of length equal to the number of rows of ref_data, containing the feature identifier for each row. Alternatively None, to use the row names of the experiment as features.

  • test_assay_type (str | int) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • ref_assay_type (str | int) – Assay containing the expression matrix, if ref_data is a SummarizedExperiment.

  • test_assay_type – Whether to remove rows with missing values from the test dataset.

  • ref_assay_type – Whether to remove rows with missing values from the reference dataset.

  • train_args (dict) – Further arguments to pass to train_single().

  • classify_args (dict) – Further arguments to pass to classify_single().

  • num_threads (int) – Number of threads to use for the various steps.

Return type:

BiocFrame

Returns:

A BiocFrame of labelling results, see classify_single() for details.

Examples

>>> # Mocking up data with log-normalized expression values:
>>> import singler
>>> ref = singler.mock_reference_data()
>>> test = singler.mock_test_data(ref)
>>>
>>> import scranpy
>>> ref = scranpy.normalize_rna_counts_se(ref)
>>> test = scranpy.normalize_rna_counts_se(test)
>>>
>>> # Running the classification:
>>> pred = singler.annotate_single(test, ref, ref_labels=ref.get_column_data()["label"])
>>> print(pred)
>>> import collections
>>> print(collections.Counter(zip(pred["best"], test.get_column_data()["label"])))
singler.classify_integrated(test_data, results, integrated_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]

Integrate classification results across multiple references for a single test dataset.

Parameters:
  • test_data (Any) –

    A matrix-like object where each row is a feature and each column is a test sample (usually a cell), containing expression values. Normalized and/or transformed expression values are also acceptable as only the ranking is used within this function.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • results (dict | Sequence | NamedList) –

    Sequence of BiocFrame objects. Each BiocFrame should contain classification results generated by running classify_single() on test_data with one of the references. References should be in the same order as that used to construct integrated_prebuilt.

    Alternatively, a dictionary where each value is a BiocFrame and each key is the name of a reference. Names should be the same as those used to construct integrated_prebuilt.

    Alternatively, a NamedList where each entry is a BiocFrame. Names should be the same as those used to construct integrated_prebuilt.

  • integrated_prebuilt (TrainedIntegratedReferences) – Integrated reference object, constructed with train_integrated().

  • assay_type (str | int) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • quantile (float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.

  • use_fine_tune (bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between references with similar best labels but requires more computational work.

  • fine_tune_threshold (float) – Maximum difference from the maximum correlation to use in fine-tuning. All references above this threshold are used for another round of fine-tuning.

  • num_threads (int) – Number of threads to use during classification.

Return type:

BiocFrame

Returns:

A BiocFrame with one row per column in test_data. It contains the following columns:

  • best_label: list containing the assigned label in the best reference for each test sample.

  • best_reference: an integer NumPy array containing the index of the best reference for each test sample.

    Each index refers to a position in results or ref_prebuilt in train_integrated().

  • scores: a nested BiocFrame where each column corresponds to a reference. Each column is another nested BiocFrame and contains the scores across labels for its corresponding reference.

  • delta: a double-precision NumPy array containing the difference in scores between the best and second-best reference.

Examples

>>> # Mocking up data.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=8)
>>> ref1 = ref[:,[True, False] * int(ref.shape[1]/2)]
>>> ref2 = ref[:,[False, True] * int(ref.shape[1]/2)]
>>>
>>> cd2 = ref2.get_column_data()
>>> label2 = [l.lower() for l in cd2["label"]] # converting to lower-case for some variety.
>>> cd2.set_column("label", label2, in_place=True)
>>> ref2.set_column_data(cd2, in_place=True)
>>>
>>> import scranpy
>>> ref1 = scranpy.normalize_rna_counts_se(ref1)
>>> ref2 = scranpy.normalize_rna_counts_se(ref2)
>>>
>>> # Performing classification within each reference.
>>> test = singler.mock_test_data(ref)
>>>
>>> built1 = singler.train_single(ref1, ref1.get_column_data()["label"], ref1.get_row_names())
>>> res1 = singler.classify_single(test, built1)
>>> built2 = singler.train_single(ref2, ref2.get_column_data()["label"], ref2.get_row_names())
>>> res2 = singler.classify_single(test, built2)
>>>
>>> # Combining results across references.
>>> in_built = singler.train_integrated(test.get_row_names(), [built1, built2])
>>> in_res = singler.classify_integrated(test, [res1, res2], in_built)
>>> print(in_res)
singler.classify_single(test_data, ref_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]

Classify a test dataset against a reference by assigning labels from the latter to each column of the former using the SingleR algorithm.

Parameters:
  • test_data (Any) –

    A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Each entry may be an expression value of any kind; only the ranking within each column is used by this function.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_prebuilt (TrainedSingleReference) – A pre-built reference created with train_single().

  • assay_type (str | int) – Assay containing the expression matrix, if test_data is a SummarizedExperiment.

  • quantile (float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.

  • use_fine_tune (bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between similar labels but requires more computational work.

  • fine_tune_threshold (float) – Maximum difference from the maximum correlation to use in fine-tuning. All labels above this threshold are used for another round of fine-tuning.

  • num_threads (int) – Number of threads to use during classification.

Return type:

BiocFrame

Returns:

A BiocFrame with one row per column of test_data. This contains the following columns:

  • best: a list containing the assigned label for each sample in test_data.

  • scores: a nested BiocFrame where each column corresponds to a reference label and contains the score for that label across all test samples.

  • delta: a double-precision NumPy array containing the difference in scores between the best and second-best label.

The metadata contains markers, a list of the markers from each pairwise comparison between labels; and used, a list containing the union of markers from all comparisons.

Examples

>>> # Mocking up data.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=8)
>>> test = singler.mock_test_data(ref)
>>>
>>> import scranpy
>>> ref = scranpy.normalize_rna_counts_se(ref)
>>>
>>> # Training and applying a classifier.
>>> built = singler.train_single(ref, ref.get_column_data()["label"], ref.get_row_names())
>>> res = singler.classify_single(test, built)
>>> print(res)
>>> import collections
>>> print(collections.Counter(zip(res["best"], test.get_column_data()["label"])))
singler.get_classic_markers(ref_data, ref_labels, ref_features, assay_type='logcounts', check_missing=True, num_de=None, num_threads=1)[source]

Compute markers from a reference using the classic SingleR algorithm. This is typically done for reference datasets derived from replicated bulk transcriptomic experiments.

Parameters:
  • ref_data (Any | list[Any]) –

    A matrix-like object containing the log-normalized expression values of a reference dataset. Each column is a sample and each row is a feature.

    Alternatively, this can be a SummarizedExperiment containing a matrix-like object in one of its assays.

    Alternatively, a list of such matrices or SummarizedExperiment objects, typically for multiple batches of the same reference. It is assumed that different batches exhibit at least some overlap in their ref_features and ref_labels.

  • ref_labels (Sequence | list[Sequence]) –

    If ref_data is not a list, ref_labels should be a sequence of length equal to the number of columns of ref_data. Each entry should be a label (usually a string) for each column of ref_data.

    If ref_data is a list, ref_labels should also be a list of the same length. Each entry should be a sequence of length equal to the number of columns of the corresponding entry of ref_data and should contain the column labels.

  • ref_features (Sequence | list[Sequence]) –

    If ref_data is not a list, ref_features should be a sequence of length equal to the number of rows of ref_data. Each entry should be a feature name (usually a string) for each row.

    If ref_data is a list, ref_features should also be a list of the same length. Each entry should be a sequence of length equal to the number of rows of the corresponding entry of ref and should containg the feature names.

  • assay_type (str | int) – Name or index of the assay of interest, if ref is or contains SummarizedExperiment objects.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values in the reference matrices. This can be set to False if it is known that no NaN values exist.

  • num_de (int | None) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. If None, an appropriate number of genes is automatically determined.

  • num_threads (int) – Number of threads to use for the calculations.

Return type:

dict[Any, dict[Any, list]]

Returns:

A dictionary of dictionary of lists containing the markers for each pairwise comparison between labels, i.e., markers[a][b] contains the upregulated markers for label a over label b.

Examples

>>> import singler
>>> ref = singler.mock_reference_data()
>>> import scranpy
>>> ref = scranpy.normalize_rna_counts_se(ref)
>>> classical = singler.get_classic_markers(ref, ref.get_column_data()["label"], ref.get_row_names())
>>> classical["A"]["B"]
singler.mock_reference_data(num_labels=5, num_replicates=4, num_genes=1000, prop_markers=0.5)[source]

Mock up some reference data for the various examples in the singler package. The simulated data is very simple and should not be used for performance comparisons.

Parameters:
  • num_labels (int) – Number of labels.

  • num_replicates (int) – Number of replicates per label.

  • num_genes (int) – Number of genes in the dataset.

  • prop_markers (float) – Proportion of genes that are markers, i.e., DE between labels.

Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment containing a count matrix and per-column labels.

Examples

>>> import singler
>>> ref = singler.mock_reference_data()
>>> print(ref)
singler.mock_test_data(mock_ref, num_cells=100)[source]

Mock up some test data for the various examples in the singler package. The simulated data is very simple and should not be used for performance comparisons.

Parameters:
Return type:

SummarizedExperiment

Returns:

A SummarizedExperiment containing a count matrix and per-cell labels.

Examples

>>> import singler
>>> ref = singler.mock_reference_data()
>>> test = singler.mock_test_data(ref)
>>> print(test)
singler.number_of_classic_markers(num_labels)[source]

Compute the number of markers to detect for a given number of labels, using the classic SingleR marker detection algorithm.

Parameters:

num_labels (int) – Number of labels.

Return type:

int

Returns:

Number of markers.

Examples

>>> import singler
>>> singler.number_of_classic_markers(15)
singler.train_integrated(test_features, ref_prebuilt, warn_lost=True, num_threads=1)[source]

Build a set of integrated references for classification of a test dataset.

Parameters:
  • test_features (Sequence) – Sequence of features for the test dataset.

  • ref_prebuilt (dict | Sequence | NamedList) – List of prebuilt references, typically created by calling train_single().

  • warn_lost (bool) – Whether to emit a warning if the markers for each reference are not all present in all references.

  • num_threads (int) – Number of threads.

Return type:

TrainedIntegratedReferences

Returns:

An integrated reference object, for classification with classify_integrated().

Examples

>>> # Mocking up data.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=8)
>>> ref1 = ref[:,[True, False] * int(ref.shape[1]/2)]
>>> ref2 = ref[:,[False, True] * int(ref.shape[1]/2)]
>>>
>>> cd2 = ref2.get_column_data()
>>> label2 = [l.lower() for l in cd2["label"]] # converting to lower case for some variety.
>>> cd2.set_column("label", label2, in_place=True)
>>> ref2.set_column_data(cd2, in_place=True)
>>>
>>> import scranpy
>>> ref1 = scranpy.normalize_rna_counts_se(ref1)
>>> ref2 = scranpy.normalize_rna_counts_se(ref2)
>>>
>>> # Building a classifier for each reference.
>>> test = singler.mock_test_data(ref)
>>> built1 = singler.train_single(ref1, ref1.get_column_data()["label"], ref1.get_row_names())
>>> built2 = singler.train_single(ref2, ref2.get_column_data()["label"], ref2.get_row_names())
>>>
>>> # Creating an integrated classifier across references.
>>> in_built = singler.train_integrated(test.get_row_names(), {"first": built1, "second": built2})
>>> in_built.reference_labels
>>> in_built.reference_names
singler.train_single(ref_data, ref_labels, ref_features, test_features=None, assay_type='logcounts', restrict_to=None, check_missing=True, markers=None, marker_method='classic', num_de=None, hint_sce=True, marker_args={}, aggregate=False, aggregate_args={}, nn_parameters=<knncolle.vptree.VptreeParameters object>, num_threads=1)[source]

Build a single reference dataset in preparation for classification.

Parameters:
  • ref_data (Any) –

    A matrix-like object where rows are features, columns are reference profiles, and each entry is the expression value. If markers is not provided, expression should be normalized and log-transformed in preparation for marker prioritization via differential expression analyses. Otherwise, any expression values are acceptable as only the ranking within each column is used.

    Alternatively, a SummarizedExperiment containing such a matrix in one of its assays.

  • ref_labels (Sequence) – Sequence of labels for each reference profile, i.e., column in ref_data.

  • ref_features (Sequence) – Sequence of identifiers for each feature, i.e., row in ref_data.

  • test_features (Sequence | None) – Sequence of identifiers for each feature in the test dataset.

  • assay_type (str | int) – Assay containing the expression matrix, if ref_data is a SummarizedExperiment.

  • check_missing (bool) – Whether to check for and remove rows with missing (NaN) values from ref_data.

  • restrict_to (set | dict | None) – Subset of available features to restrict to. Only features in restrict_to will be used in the reference building. If None, no restriction is performed.

  • markers (dict[Any, dict[Any, Sequence]] | None) – Upregulated markers for each pairwise comparison between labels. Specifically, markers[a][b] should be a sequence of features that are upregulated in a compared to b. All such features should be present in features, and all labels in labels should have keys in the inner and outer dictionaries.

  • marker_method (Literal['classic', 'auc', 'cohens_d']) – Method to identify markers from each pairwise comparisons between labels in ref_data. If classic, we call get_classic_markers(). If auc or cohens_d, we call score_markers(). Only used if markers is not supplied.

  • num_de (int | None) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. If None and marker_method = "classic", an appropriate number of genes is determined by get_classic_markers(). Otherwise, it is set to 10. Only used if markers is not supplied.

  • hint_sce (bool) – Whether to print a hint to change marker_method when ref_data is a SingleCellExperiment. It will also suggest setting aggregate = True for greater efficiency when ref_data contains 1000 cells or more.

  • marker_args (dict) – Further arguments to pass to the chosen marker detection method. If marker_method = "classic", this is get_classic_markers(), otherwise it is score_markers(). Only used if markers is not supplied.

  • aggregate (bool) – Whether the reference dataset should be aggregated to pseudo-bulk samples for speed, see aggregate_reference() for details.

  • aggregate_args (dict) – Further arguments to pass to aggregate_reference() when aggregate = True.

  • nn_parameters (Parameters | None) – Algorithm for constructing the neighbor search index, used to compute scores during classification.

  • num_threads (int) – Number of threads to use for reference building.

Return type:

TrainedSingleReference

Returns:

The pre-built reference, ready for use in downstream methods like classify_single().

Examples

>>> # Mocking up data.
>>> import singler
>>> ref = singler.mock_reference_data(num_replicates=8)
>>> test = singler.mock_test_data(ref)
>>>
>>> import scranpy
>>> ref = scranpy.normalize_rna_counts_se(ref)
>>>
>>> # Training a classifier.
>>> built = singler.train_single(ref, ref.get_column_data()["label"], ref.get_row_names())
>>> built.num_labels
>>> built.num_markers
>>> built.labels
>>> built.markers["A"]["B"] # markers for A over B.
>>> len(built.marker_subset())