singler package¶
Submodules¶
singler.aggregate_reference module¶
- singler.aggregate_reference.aggregate_reference(ref_data, ref_labels, ref_features, num_centers=None, power=0.5, num_top=1000, rank=20, assay_type='logcounts', subset_row=None, check_missing=True, num_threads=1)[source]¶
Aggregate reference samples for a given label by using vector quantization to average their count profiles. The idea is to reduce the size of single-cell reference datasets so as to reduce the computation time of
train_single()
. We perform k-means clustering for all cells in each label and aggregate all cells within each k-means cluster. (More specifically, the clustering is done on the principal components generated from the highly variable genes to better capture the structure within each label.) This yields one or more profiles per label, reducing the number of separate observations while preserving some level of intra-label heterogeneity.- Parameters:
ref_data (
Any
) – Floating-point matrix of reference expression values, usually containing log-expression values. Alternatively, aSummarizedExperiment
object containing such a matrix.ref_labels (
Sequence
) – Array of length equal to the number of columns inref_data
, containing the labels for each cell.ref_features (
Sequence
) – Sequence of identifiers for each feature, i.e., row inref_data
.num_centers (
Optional
[int
]) – Maximum number of aggregated profiles to produce for each label withcluster_kmeans()
. IfNone
, a suitable number of profiles is automatically chosen.power (
float
) – Number between 0 and 1 indicating how much aggregation should be performed. Specifically, we set the number of clusters toX**power
whereX
is the number of cells assigned to that label. Ignored ifnum_centers
is notNone
.num_top (
int
) – Number of highly variable genes to use for PCA prior to clustering, seechoose_highly_variable_genes()
.rank (
int
) – Number of principal components to use during clustering, seerun_pca()
.assay_type (
Union
[int
,str
]) – Integer or string specifying the assay ofref_data
containing the relevant expression matrix, ifref
is aSummarizedExperiment
object.subset_row (
Optional
[Sequence
]) – Array of row indices specifying the rows ofref_data
to use for clustering. IfNone
, no additional filtering is performed. Note that even ifsubset_row
is provided, aggregation is still performed on all genes.check_missing (
bool
) – Whether to check for and remove rows with missing (NaN) values fromref_data
.num_threads (
int
) – Number of threads to use.
- Return type:
- Returns:
A
SummarizedExperiment
containing the aggregated values in its first assay. The label for each aggregated profile is stored in the column data.
singler.annotate_integrated module¶
- singler.annotate_integrated.annotate_integrated(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type='logcounts', test_check_missing=False, ref_check_missing=True, train_single_args={}, classify_single_args={}, train_integrated_args={}, classify_integrated_args={}, num_threads=1)[source]¶
Annotate a single-cell expression dataset based on the correlation of each cell to profiles in multiple labelled references, where the annotation from each reference is then integrated across references.
- Parameters:
test_data (
Any
) –A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column is used.
Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays.ref_data (
Sequence
) –Sequence consisting of one or more of the following:
A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the
ref_data
argument intrain_single()
).A
SummarizedExperiment
object containing such a matrix in its assays.
ref_labels (
list
[Sequence
]) – Sequence of the same length asref_data
. Thei
-th entry should be a sequence of length equal to the number of columns ofref_data[i]
, containing the label associated with each column.test_features (
Optional
[Sequence
]) – Sequence of length equal to the number of rows intest_data
, containing the feature identifier for each row. AlternativelyNone
, to use the row names of the experiment as features.ref_features (
Optional
[list
[Optional
[Sequence
]]]) –Sequence of the same length as
ref_data
. Thei
-th entry should be a sequence of length equal to the number of rows ofref_data[i]
, containing the feature identifier associated with each row. It can also be set toNone
to use the row names of the experiment as features.This can also be
None
to indicate that the row names should be used for all references, assumingref_data
only containsSummarizedExperiment
objects.test_assay_type (
Union
[str
,int
]) – Assay oftest_data
containing the expression matrix, iftest_data
is aSummarizedExperiment
.test_check_missing (
bool
) – Whether to check for and remove missing (i.e., NaN) values from the test dataset.ref_assay_type (
Union
[str
,int
]) – Assay containing the expression matrix for any entry ofref_data
that is aSummarizedExperiment
.ref_check_missing (
bool
) – Whether to check for and remove missing (i.e., NaN) values from the reference datasets.train_single_args (
dict
) – Further arguments to pass totrain_single()
.classify_single_args (
dict
) – Further arguments to pass toclassify_single()
.train_integrated_args (
dict
) – Further arguments to pass totrain_integrated()
.classify_integrated_args (
dict
) – Further arguments to pass toclassify_integrated()
.num_threads (
int
) – Number of threads to use for the various steps.
- Return type:
- Returns:
Tuple where the first element contains per-reference results (i.e. a list of BiocFrame outputs, roughly equivalent to running
annotate_single()
on each reference) and the second element contains integrated results across references (i.e., a BiocFrame fromclassify_integrated()
).
singler.annotate_single module¶
- singler.annotate_single.annotate_single(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type=0, test_check_missing=False, ref_check_missing=True, train_args={}, classify_args={}, num_threads=1)[source]¶
Annotate a single-cell expression dataset based on the correlation of each cell to profiles in a labelled reference.
- Parameters:
test_data (
Any
) –A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column will be used.
Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays. Non-default assay types can be specified inclassify_args
.ref_data (
Any
) –A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the
ref
argument intrain_single()
).Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays. Non-default assay types can be specified inclassify_args
.ref_labels (
Sequence
) – Sequence of length equal to the number of columns ofref_data
, containing the label associated with each column.test_features (
Optional
[Sequence
]) – Sequence of length equal to the number of rows intest_data
, containing the feature identifier for each row. AlternativelyNone
, to use the row names of the experiment as features.ref_features (
Optional
[Sequence
]) – Sequence of length equal to the number of rows ofref_data
, containing the feature identifier for each row. AlternativelyNone
, to use the row names of the experiment as features.test_assay_type (
Union
[str
,int
]) – Assay containing the expression matrix, iftest_data
is aSummarizedExperiment
.ref_assay_type (
Union
[str
,int
]) – Assay containing the expression matrix, ifref_data
is aSummarizedExperiment
.test_assay_type – Whether to remove rows with missing values from the test dataset.
ref_assay_type – Whether to remove rows with missing values from the reference dataset.
train_args (
dict
) – Further arguments to pass totrain_single()
.classify_args (
dict
) – Further arguments to pass toclassify_single()
.num_threads (
int
) – Number of threads to use for the various steps.
- Return type:
- Returns:
A
BiocFrame
of labelling results, seeclassify_single()
for details.
singler.classify_integrated module¶
- singler.classify_integrated.classify_integrated(test_data, results, integrated_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]¶
Integrate classification results across multiple references for a single test dataset.
- Parameters:
test_data (
Any
) –A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and/or transformed expression values are also acceptable as only the ranking is used within this function.
Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays.results (
list
[BiocFrame
]) – List of classification results generated by runningclassify_single()
ontest_data
with each reference. References should be in the same order as that used to constructintegrated_prebuilt
.integrated_prebuilt (
TrainedIntegratedReferences
) – Integrated reference object, constructed withtrain_integrated()
.assay_type (
Union
[str
,int
]) – Assay containing the expression matrix, iftest_data
is aSummarizedExperiment
.quantile (
float
) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.use_fine_tune (
bool
) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between references with similar best labels but requires more computational work.fine_tune_threshold (
float
) – Maximum difference from the maximum correlation to use in fine-tuning. All references above this threshold are used for another round of fine-tuning.num_threads (
int
) – Number of threads to use during classification.
- Return type:
- Returns:
A
BiocFrame
containing thebest_label
across all references, defined as the assigned label in the best reference; the identity of thebest_reference
, either as a name string or an integer index; thescores
for the best label in each reference, as a nestedBiocFrame
; and thedelta
from the best to the second-best reference. Each row corresponds to a column oftest_data
.
singler.classify_single module¶
- singler.classify_single.classify_single(test_data, ref_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]¶
Classify a test dataset against a reference by assigning labels from the latter to each column of the former using the SingleR algorithm.
- Parameters:
test_data (
Any
) –A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and transformed expression values are also acceptable as only the ranking is used within this function.
Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays.ref_prebuilt (
TrainedSingleReference
) – A pre-built reference created withtrain_single()
.assay_type (
Union
[str
,int
]) – Assay containing the expression matrix, iftest_data
is aSummarizedExperiment
.quantile (
float
) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.use_fine_tune (
bool
) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between similar labels but requires more computational work.fine_tune_threshold (
float
) – Maximum difference from the maximum correlation to use in fine-tuning. All labels above this threshold are used for another round of fine-tuning.num_threads (
int
) – Number of threads to use during classification.
- Return type:
- Returns:
A
BiocFrame
containing thebest
label, thescores
for each label as a nestedBiocFrame
, and thedelta
from the best to the second-best label. Each row corresponds to a column oftest
. The metadata containsmarkers
, a list of the markers from each pairwise comparison between labels; andused
, a list containing the union of markers from all comparisons.
singler.get_classic_markers module¶
- singler.get_classic_markers.get_classic_markers(ref_data, ref_labels, ref_features, assay_type='logcounts', check_missing=True, num_de=None, num_threads=1)[source]¶
Compute markers from a reference using the classic SingleR algorithm. This is typically done for reference datasets derived from replicated bulk transcriptomic experiments.
- Parameters:
ref_data (
Union
[Any
,list
[Any
]]) –A matrix-like object containing the log-normalized expression values of a reference dataset. Each column is a sample and each row is a feature.
Alternatively, this can be a
SummarizedExperiment
containing a matrix-like object in one of its assays.Alternatively, a list of such matrices or
SummarizedExperiment
objects, typically for multiple batches of the same reference; it is assumed that different batches exhibit at least some overlap in theirref_features
andref_labels
.ref_labels (
Union
[Sequence
,list
[Sequence
]]) –If
ref_data
is not a list,ref_labels
should be a sequence of length equal to the number of columns ofref_data
, containing a label (usually a string) for each column.If
ref_data
is a list,ref_labels
should also be a list of the same length. Each entry should be a sequence of length equal to the number of columns of the corresponding entry ofref_data
.ref_features (
Union
[Sequence
,list
[Sequence
]]) –If
ref_data
is not a list,ref_features
should be a sequence of length equal to the number of rows ofref_data
, containing the feature name (usually a string) for each row.If
ref_data
is a list,ref_features
should also be a list of the same length. Each entry should be a sequence of length equal to the number of rows of the corresponding entry ofref
.assay_type (
Union
[str
,int
]) – Name or index of the assay of interest, ifref
is or containsSummarizedExperiment
objects.check_missing (
bool
) – Whether to check for and remove rows with missing (NaN) values in the reference matrices. This can be set to False if it is known that no NaN values exist.num_de (
Optional
[int
]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. IfNone
, an appropriate number of genes is automatically determined.num_threads (
int
) – Number of threads to use for the calculations.
- Return type:
- Returns:
A dictionary of dictionary of lists containing the markers for each pairwise comparison between labels, i.e.,
markers[a][b]
contains the upregulated markers for labela
over labelb
.
singler.train_integrated module¶
- class singler.train_integrated.TrainedIntegratedReferences(ptr, ref_labels)[source]¶
Bases:
object
Object containing integrated references, typically constructed by
train_integrated()
.
- singler.train_integrated.train_integrated(test_features, ref_prebuilt, warn_lost=True, num_threads=1)[source]¶
Build a set of integrated references for classification of a test dataset.
- Parameters:
test_features (
Sequence
) – Sequence of features for the test dataset.ref_prebuilt (
list
[TrainedSingleReference
]) – List of prebuilt references, typically created by callingtrain_single()
.warn_lost (
bool
) – Whether to emit a warning if the markers for each reference are not all present in all references.num_threads (
int
) – Number of threads.
- Return type:
- Returns:
Integrated references for classification with
classify_integrated()
.
singler.train_single module¶
- class singler.train_single.TrainedSingleReference(ptr, full_data, full_label_codes, labels, features, markers)[source]¶
Bases:
object
A prebuilt reference object, typically created by
train_single()
. This is intended for advanced users only and should not be serialized.- marker_subset(indices_only=False)[source]¶
- Parameters:
indices_only (
bool
) – Whether to return the markers as indices intofeatures
, or as a list of feature identifiers.- Return type:
- Returns:
If
indices_only = False
, a list of feature identifiers for the markers.If
indices_only = True
, a NumPy array containing the integer indices of features infeatures
that were chosen as markers.
- num_markers()[source]¶
- Return type:
- Returns:
Number of markers to be used for classification. This is the same as the size of the array from
marker_subset()
.
- singler.train_single.train_single(ref_data, ref_labels, ref_features, test_features=None, assay_type='logcounts', restrict_to=None, check_missing=True, markers=None, marker_method='classic', num_de=None, marker_args={}, aggregate=False, aggregate_args={}, nn_parameters=<knncolle.vptree.VptreeParameters object>, num_threads=1)[source]¶
Build a single reference dataset in preparation for classification.
- Parameters:
ref_data (
Any
) –A matrix-like object where rows are features, columns are reference profiles, and each entry is the expression value. If
markers
is not provided, expression should be normalized and log-transformed in preparation for marker prioritization via differential expression analyses. Otherwise, any expression values are acceptable as only the ranking within each column is used.Alternatively, a
SummarizedExperiment
containing such a matrix in one of its assays.ref_labels (
Sequence
) – Sequence of labels for each reference profile, i.e., column inref_data
.ref_features (
Sequence
) – Sequence of identifiers for each feature, i.e., row inref_data
.test_features (
Optional
[Sequence
]) – Sequence of identifiers for each feature in the test dataset.assay_type (
Union
[str
,int
]) – Assay containing the expression matrix, ifref_data
is aSummarizedExperiment
.check_missing (
bool
) – Whether to check for and remove rows with missing (NaN) values fromref_data
.restrict_to (
Union
[set
,dict
,None
]) – Subset of available features to restrict to. Only features inrestrict_to
will be used in the reference building. IfNone
, no restriction is performed.markers (
Optional
[dict
[Any
,dict
[Any
,Sequence
]]]) – Upregulated markers for each pairwise comparison between labels. Specifically,markers[a][b]
should be a sequence of features that are upregulated ina
compared tob
. All such features should be present infeatures
, and all labels inlabels
should have keys in the inner and outer dictionaries.marker_method (
Literal
['classic'
,'auc'
,'cohens_d'
]) – Method to identify markers from each pairwise comparisons between labels inref_data
. Ifclassic
, we callget_classic_markers()
. Ifauc
orcohens_d
, we callscore_markers()
. Only used ifmarkers
is not supplied.num_de (
Optional
[int
]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. IfNone
andmarker_method = "classic"
, an appropriate number of genes is determined byget_classic_markers()
. Otherwise, it is set to 10. Only used ifmarkers
is not supplied.marker_args (
dict
) – Further arguments to pass to the chosen marker detection method. Ifmarker_method = "classic"
, this isget_classic_markers()
, otherwise it isscore_markers()
. Only used ifmarkers
is not supplied.aggregate (
bool
) – Whether the reference dataset should be aggregated to pseudo-bulk samples for speed, seeaggregate_reference()
for details.aggregate_args (
dict
) – Further arguments to pass toaggregate_reference()
whenaggregate = True
.nn_parameters (
Optional
[Parameters
]) – Algorithm for constructing the neighbor search index, used to compute scores during classification.num_threads (
int
) – Number of threads to use for reference building.
- Return type:
- Returns:
The pre-built reference, ready for use in downstream methods like
classify_single()
.