singler package¶
Submodules¶
singler.aggregate_reference module¶
- singler.aggregate_reference.aggregate_reference(ref_data, ref_labels, ref_features, num_centers=None, power=0.5, num_top=1000, rank=20, assay_type='logcounts', subset_row=None, check_missing=True, num_threads=1)[source]¶
Aggregate reference samples for a given label by using vector quantization to average their count profiles. The idea is to reduce the size of single-cell reference datasets so as to reduce the computation time of
train_single(). We perform k-means clustering for all cells in each label and aggregate all cells within each k-means cluster. (More specifically, the clustering is done on the principal components generated from the highly variable genes to better capture the structure within each label.) This yields one or more profiles per label, reducing the number of separate observations while preserving some level of intra-label heterogeneity.- Parameters:
ref_data (
Any) – Floating-point matrix of reference expression values, usually containing log-expression values. Alternatively, aSummarizedExperimentobject containing such a matrix.ref_labels (
Sequence) – Array of length equal to the number of columns inref_data, containing the labels for each cell.ref_features (
Sequence) – Sequence of identifiers for each feature, i.e., row inref_data.num_centers (
Optional[int]) – Maximum number of aggregated profiles to produce for each label withcluster_kmeans(). IfNone, a suitable number of profiles is automatically chosen.power (
float) – Number between 0 and 1 indicating how much aggregation should be performed. Specifically, we set the number of clusters toX**powerwhereXis the number of cells assigned to that label. Ignored ifnum_centersis notNone.num_top (
int) – Number of highly variable genes to use for PCA prior to clustering, seechoose_highly_variable_genes().rank (
int) – Number of principal components to use during clustering, seerun_pca().assay_type (
Union[int,str]) – Integer or string specifying the assay ofref_datacontaining the relevant expression matrix, ifrefis aSummarizedExperimentobject.subset_row (
Optional[Sequence]) – Array of row indices specifying the rows ofref_datato use for clustering. IfNone, no additional filtering is performed. Note that even ifsubset_rowis provided, aggregation is still performed on all genes.check_missing (
bool) – Whether to check for and remove rows with missing (NaN) values fromref_data.num_threads (
int) – Number of threads to use.
- Return type:
- Returns:
A
SummarizedExperimentcontaining the aggregated values in its first assay. The label for each aggregated profile is stored in the column data.
singler.annotate_integrated module¶
- singler.annotate_integrated.annotate_integrated(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type='logcounts', test_check_missing=False, ref_check_missing=True, train_single_args={}, classify_single_args={}, train_integrated_args={}, classify_integrated_args={}, num_threads=1)[source]¶
Annotate a single-cell expression dataset based on the correlation of each cell to profiles in multiple labelled references, where the annotation from each reference is then integrated across references.
- Parameters:
test_data (
Any) –A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column is used.
Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays.ref_data (
Sequence) –Sequence consisting of one or more of the following:
A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the
ref_dataargument intrain_single()).A
SummarizedExperimentobject containing such a matrix in its assays.
ref_labels (
list[Sequence]) – Sequence of the same length asref_data. Thei-th entry should be a sequence of length equal to the number of columns ofref_data[i], containing the label associated with each column.test_features (
Optional[Sequence]) – Sequence of length equal to the number of rows intest_data, containing the feature identifier for each row. AlternativelyNone, to use the row names of the experiment as features.ref_features (
Optional[list[Optional[Sequence]]]) –Sequence of the same length as
ref_data. Thei-th entry should be a sequence of length equal to the number of rows ofref_data[i], containing the feature identifier associated with each row. It can also be set toNoneto use the row names of the experiment as features.This can also be
Noneto indicate that the row names should be used for all references, assumingref_dataonly containsSummarizedExperimentobjects.test_assay_type (
Union[str,int]) – Assay oftest_datacontaining the expression matrix, iftest_datais aSummarizedExperiment.test_check_missing (
bool) – Whether to check for and remove missing (i.e., NaN) values from the test dataset.ref_assay_type (
Union[str,int]) – Assay containing the expression matrix for any entry ofref_datathat is aSummarizedExperiment.ref_check_missing (
bool) – Whether to check for and remove missing (i.e., NaN) values from the reference datasets.train_single_args (
dict) – Further arguments to pass totrain_single().classify_single_args (
dict) – Further arguments to pass toclassify_single().train_integrated_args (
dict) – Further arguments to pass totrain_integrated().classify_integrated_args (
dict) – Further arguments to pass toclassify_integrated().num_threads (
int) – Number of threads to use for the various steps.
- Return type:
- Returns:
Tuple where the first element contains per-reference results (i.e. a list of BiocFrame outputs, roughly equivalent to running
annotate_single()on each reference) and the second element contains integrated results across references (i.e., a BiocFrame fromclassify_integrated()).
singler.annotate_single module¶
- singler.annotate_single.annotate_single(test_data, ref_data, ref_labels, test_features=None, ref_features=None, test_assay_type=0, ref_assay_type=0, test_check_missing=False, ref_check_missing=True, train_args={}, classify_args={}, num_threads=1)[source]¶
Annotate a single-cell expression dataset based on the correlation of each cell to profiles in a labelled reference.
- Parameters:
test_data (
Any) –A matrix-like object representing the test dataset, where rows are features and columns are samples (usually cells). Entries should be expression values; only the ranking within each column will be used.
Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays. Non-default assay types can be specified inclassify_args.ref_data (
Any) –A matrix-like object representing the reference dataset, where rows are features and columns are samples. Entries should be expression values, usually log-transformed (see comments for the
refargument intrain_single()).Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays. Non-default assay types can be specified inclassify_args.ref_labels (
Sequence) – Sequence of length equal to the number of columns ofref_data, containing the label associated with each column.test_features (
Optional[Sequence]) – Sequence of length equal to the number of rows intest_data, containing the feature identifier for each row. AlternativelyNone, to use the row names of the experiment as features.ref_features (
Optional[Sequence]) – Sequence of length equal to the number of rows ofref_data, containing the feature identifier for each row. AlternativelyNone, to use the row names of the experiment as features.test_assay_type (
Union[str,int]) – Assay containing the expression matrix, iftest_datais aSummarizedExperiment.ref_assay_type (
Union[str,int]) – Assay containing the expression matrix, ifref_datais aSummarizedExperiment.test_assay_type – Whether to remove rows with missing values from the test dataset.
ref_assay_type – Whether to remove rows with missing values from the reference dataset.
train_args (
dict) – Further arguments to pass totrain_single().classify_args (
dict) – Further arguments to pass toclassify_single().num_threads (
int) – Number of threads to use for the various steps.
- Return type:
- Returns:
A
BiocFrameof labelling results, seeclassify_single()for details.
singler.classify_integrated module¶
- singler.classify_integrated.classify_integrated(test_data, results, integrated_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]¶
Integrate classification results across multiple references for a single test dataset.
- Parameters:
test_data (
Any) –A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and/or transformed expression values are also acceptable as only the ranking is used within this function.
Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays.results (
list[BiocFrame]) – List of classification results generated by runningclassify_single()ontest_datawith each reference. References should be in the same order as that used to constructintegrated_prebuilt.integrated_prebuilt (
TrainedIntegratedReferences) – Integrated reference object, constructed withtrain_integrated().assay_type (
Union[str,int]) – Assay containing the expression matrix, iftest_datais aSummarizedExperiment.quantile (
float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.use_fine_tune (
bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between references with similar best labels but requires more computational work.fine_tune_threshold (
float) – Maximum difference from the maximum correlation to use in fine-tuning. All references above this threshold are used for another round of fine-tuning.num_threads (
int) – Number of threads to use during classification.
- Return type:
- Returns:
A
BiocFramecontaining thebest_labelacross all references, defined as the assigned label in the best reference; the identity of thebest_reference, either as a name string or an integer index; thescoresfor the best label in each reference, as a nestedBiocFrame; and thedeltafrom the best to the second-best reference. Each row corresponds to a column oftest_data.
singler.classify_single module¶
- singler.classify_single.classify_single(test_data, ref_prebuilt, assay_type=0, quantile=0.8, use_fine_tune=True, fine_tune_threshold=0.05, num_threads=1)[source]¶
Classify a test dataset against a reference by assigning labels from the latter to each column of the former using the SingleR algorithm.
- Parameters:
test_data (
Any) –A matrix-like object where each row is a feature and each column is a test sample (usually a single cell), containing expression values. Normalized and transformed expression values are also acceptable as only the ranking is used within this function.
Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays.ref_prebuilt (
TrainedSingleReference) – A pre-built reference created withtrain_single().assay_type (
Union[str,int]) – Assay containing the expression matrix, iftest_datais aSummarizedExperiment.quantile (
float) – Quantile of the correlation distribution for computing the score for each label. Larger values increase sensitivity of matches at the expense of similarity to the average behavior of each label.use_fine_tune (
bool) – Whether fine-tuning should be performed. This improves accuracy for distinguishing between similar labels but requires more computational work.fine_tune_threshold (
float) – Maximum difference from the maximum correlation to use in fine-tuning. All labels above this threshold are used for another round of fine-tuning.num_threads (
int) – Number of threads to use during classification.
- Return type:
- Returns:
A
BiocFramecontaining thebestlabel, thescoresfor each label as a nestedBiocFrame, and thedeltafrom the best to the second-best label. Each row corresponds to a column oftest. The metadata containsmarkers, a list of the markers from each pairwise comparison between labels; andused, a list containing the union of markers from all comparisons.
singler.get_classic_markers module¶
- singler.get_classic_markers.get_classic_markers(ref_data, ref_labels, ref_features, assay_type='logcounts', check_missing=True, num_de=None, num_threads=1)[source]¶
Compute markers from a reference using the classic SingleR algorithm. This is typically done for reference datasets derived from replicated bulk transcriptomic experiments.
- Parameters:
ref_data (
Union[Any,list[Any]]) –A matrix-like object containing the log-normalized expression values of a reference dataset. Each column is a sample and each row is a feature.
Alternatively, this can be a
SummarizedExperimentcontaining a matrix-like object in one of its assays.Alternatively, a list of such matrices or
SummarizedExperimentobjects, typically for multiple batches of the same reference; it is assumed that different batches exhibit at least some overlap in theirref_featuresandref_labels.ref_labels (
Union[Sequence,list[Sequence]]) –If
ref_datais not a list,ref_labelsshould be a sequence of length equal to the number of columns ofref_data, containing a label (usually a string) for each column.If
ref_datais a list,ref_labelsshould also be a list of the same length. Each entry should be a sequence of length equal to the number of columns of the corresponding entry ofref_data.ref_features (
Union[Sequence,list[Sequence]]) –If
ref_datais not a list,ref_featuresshould be a sequence of length equal to the number of rows ofref_data, containing the feature name (usually a string) for each row.If
ref_datais a list,ref_featuresshould also be a list of the same length. Each entry should be a sequence of length equal to the number of rows of the corresponding entry ofref.assay_type (
Union[str,int]) – Name or index of the assay of interest, ifrefis or containsSummarizedExperimentobjects.check_missing (
bool) – Whether to check for and remove rows with missing (NaN) values in the reference matrices. This can be set to False if it is known that no NaN values exist.num_de (
Optional[int]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. IfNone, an appropriate number of genes is automatically determined.num_threads (
int) – Number of threads to use for the calculations.
- Return type:
- Returns:
A dictionary of dictionary of lists containing the markers for each pairwise comparison between labels, i.e.,
markers[a][b]contains the upregulated markers for labelaover labelb.
singler.train_integrated module¶
- class singler.train_integrated.TrainedIntegratedReferences(ptr, ref_labels)[source]¶
Bases:
objectObject containing integrated references, typically constructed by
train_integrated().
- singler.train_integrated.train_integrated(test_features, ref_prebuilt, warn_lost=True, num_threads=1)[source]¶
Build a set of integrated references for classification of a test dataset.
- Parameters:
test_features (
Sequence) – Sequence of features for the test dataset.ref_prebuilt (
list[TrainedSingleReference]) – List of prebuilt references, typically created by callingtrain_single().warn_lost (
bool) – Whether to emit a warning if the markers for each reference are not all present in all references.num_threads (
int) – Number of threads.
- Return type:
- Returns:
Integrated references for classification with
classify_integrated().
singler.train_single module¶
- class singler.train_single.TrainedSingleReference(ptr, full_data, full_label_codes, labels, features, markers)[source]¶
Bases:
objectA prebuilt reference object, typically created by
train_single(). This is intended for advanced users only and should not be serialized.- marker_subset(indices_only=False)[source]¶
- Parameters:
indices_only (
bool) – Whether to return the markers as indices intofeatures, or as a list of feature identifiers.- Return type:
- Returns:
If
indices_only = False, a list of feature identifiers for the markers.If
indices_only = True, a NumPy array containing the integer indices of features infeaturesthat were chosen as markers.
- num_markers()[source]¶
- Return type:
- Returns:
Number of markers to be used for classification. This is the same as the size of the array from
marker_subset().
- singler.train_single.train_single(ref_data, ref_labels, ref_features, test_features=None, assay_type='logcounts', restrict_to=None, check_missing=True, markers=None, marker_method='classic', num_de=None, marker_args={}, aggregate=False, aggregate_args={}, nn_parameters=<knncolle.vptree.VptreeParameters object>, num_threads=1)[source]¶
Build a single reference dataset in preparation for classification.
- Parameters:
ref_data (
Any) –A matrix-like object where rows are features, columns are reference profiles, and each entry is the expression value. If
markersis not provided, expression should be normalized and log-transformed in preparation for marker prioritization via differential expression analyses. Otherwise, any expression values are acceptable as only the ranking within each column is used.Alternatively, a
SummarizedExperimentcontaining such a matrix in one of its assays.ref_labels (
Sequence) – Sequence of labels for each reference profile, i.e., column inref_data.ref_features (
Sequence) – Sequence of identifiers for each feature, i.e., row inref_data.test_features (
Optional[Sequence]) – Sequence of identifiers for each feature in the test dataset.assay_type (
Union[str,int]) – Assay containing the expression matrix, ifref_datais aSummarizedExperiment.check_missing (
bool) – Whether to check for and remove rows with missing (NaN) values fromref_data.restrict_to (
Union[set,dict,None]) – Subset of available features to restrict to. Only features inrestrict_towill be used in the reference building. IfNone, no restriction is performed.markers (
Optional[dict[Any,dict[Any,Sequence]]]) – Upregulated markers for each pairwise comparison between labels. Specifically,markers[a][b]should be a sequence of features that are upregulated inacompared tob. All such features should be present infeatures, and all labels inlabelsshould have keys in the inner and outer dictionaries.marker_method (
Literal['classic','auc','cohens_d']) – Method to identify markers from each pairwise comparisons between labels inref_data. Ifclassic, we callget_classic_markers(). Ifaucorcohens_d, we callscore_markers(). Only used ifmarkersis not supplied.num_de (
Optional[int]) – Number of differentially expressed genes to use as markers for each pairwise comparison between labels. IfNoneandmarker_method = "classic", an appropriate number of genes is determined byget_classic_markers(). Otherwise, it is set to 10. Only used ifmarkersis not supplied.marker_args (
dict) – Further arguments to pass to the chosen marker detection method. Ifmarker_method = "classic", this isget_classic_markers(), otherwise it isscore_markers(). Only used ifmarkersis not supplied.aggregate (
bool) – Whether the reference dataset should be aggregated to pseudo-bulk samples for speed, seeaggregate_reference()for details.aggregate_args (
dict) – Further arguments to pass toaggregate_reference()whenaggregate = True.nn_parameters (
Optional[Parameters]) – Algorithm for constructing the neighbor search index, used to compute scores during classification.num_threads (
int) – Number of threads to use for reference building.
- Return type:
- Returns:
The pre-built reference, ready for use in downstream methods like
classify_single().