singlepp
A C++ library for cell type classification
|
This repository contains a C++ port of the SingleR R package for automated cell type annotation. Given a test matrix of single-cell (expression) values, it compares each cell to a reference dataset with known cell type labels. Scoring is based on Spearman's rank correlation across the marker genes for each labels, with additional fine-tuning to distinguish between closely related labels. singlepp returns these scores along with the best label for each cell in the test dataset. We provide methods for annotation based on a single reference as well as integration of labels across multiple references.
singlepp is a header-only library, so it can be easily used by just #include
ing the relevant source files. Assuming the reference matrix, labels and markers are available, we can easily run the classification:
See the reference documentation for more details.
Given a reference dataset, singlepp implements a simple method of identifying marker genes between labels. This is based on ranking the differences in median log-expression values between labels and is the "classic" method provided in the original SingleR package.
The classic_markers
can then be directly used in train_single()
. Of course, other marker detection schemes can be used, depending on the type of reference dataset. For single-cell references, users may be interested in some of the differential analysis methods in the libscran library.
By default, it is expected that the markers
supplied to train_single()
has already been filtered to only the top markers for each pairwise comparison. However, in some cases, it might be more convenient for markers
to contain a ranking of all genes such that the desired subset of top markers can be chosen later. This is achieved by setting TrainSingleOptions::top
to the desired number of markers per comparison, e.g., for 20 markers:
Doing so is roughly equivalent to slicing each vector in markers
to the top 20 entries before calling train_single()
. In fact, calling set_top()
is the better approach when intersecting feature spaces - see below - as the top set will not be contaminated by genes that are not present in the test dataset.
Often the reference dataset will not have the same genes as the test dataset. To handle this case, users should call train_single_intersect()
with the row identifiers of the reference and test matrices.
Then, classify_single_intersect()
will perform classification using only the intersection of genes between the two datasets:
The gene identifiers can be anything that can be hashed and compared. These are most commonly std::string
s but can also be integers (e.g., for Entrez IDs).
To combine results from multiple references, we first need to perform classification within each reference. Let's say we have two references A and B:
We build the integrated classifier:
And then we can finally run the scoring. For each cell in the test dataset, classify_integrated()
picks the best label among the assignments from each individual reference.
FetchContent
If you're using CMake, you just need to add something like this to your CMakeLists.txt
:
Then you can link to singlepp to make the headers available during compilation:
find_package()
To install the library, use:
By default, this will use FetchContent
to fetch all external dependencies. If you want to install them manually, use -DSINGLEPP_FETCH_EXTERN=OFF
. See the tags in extern/CMakeLists.txt
to find compatible versions of each dependency.
If you're not using CMake, the simple approach is to just copy the files in include/
- either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I
. This assumes that the external dependencies listed in extern/CMakeLists.txt
are available during compilation.
Aran D et al. (2019). Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163-172