singlepp
A C++ library for cell type classification
Loading...
Searching...
No Matches
C++ port of SingleR

Unit tests Documentation R comparison codecov

Overview

This repository contains a C++ port of the SingleR R package for automated cell type annotation. Given a test matrix of single-cell (expression) values, it compares each cell to a reference dataset with known cell type labels. Scoring is based on Spearman's rank correlation across the marker genes for each labels, with additional fine-tuning to distinguish between closely related labels. singlepp returns these scores along with the best label for each cell in the test dataset. We provide methods for annotation based on a single reference as well as integration of labels across multiple references.

Quick start

singlepp is a header-only library, so it can be easily used by just #includeing the relevant source files. Assuming the reference matrix, labels and markers are available, we can easily run the classification:

// Prepare the reference matrix as a tatami::NumericMatrix.
ref_mat;
// Prepare a vector of labels, one per column of ref_mat.
ref_labels;
// Prepare a vector of vectors of markers for pairwise comparisons between labels.
ref_markers;
// Building the classifier.
auto trained = singlepp::train_single(
ref_mat,
ref_labels.data(),
ref_markers,
train_opt
);
// Running the classification on the test matrix.
auto res = singlepp::classify_single(test_mat, trained, class_opt);
TrainedSingle< Index_, Float_ > train_single(const tatami::Matrix< Value_, Index_ > &ref, const Label_ *labels, Markers< Index_ > markers, const TrainSingleOptions< Index_, Float_ > &options)
Definition train_single.hpp:181
void classify_single(const tatami::Matrix< Value_, Index_ > &test, const TrainedSingle< Index_, Float_ > &trained, const ClassifySingleBuffers< Label_, Float_ > &buffers, const ClassifySingleOptions< Float_ > &options)
Implements the SingleR algorithm for automated annotation of single-cell RNA-seq data.
Definition classify_single.hpp:131
Umbrella header for the singlepp library.
Options for classify_single() and friends.
Definition classify_single.hpp:26
Options for train_single() and friends.
Definition train_single.hpp:28

See the reference documentation for more details.

Identifying markers

Given a reference dataset, singlepp implements a simple method of identifying marker genes between labels. This is based on ranking the differences in median log-expression values between labels and is the "classic" method provided in the original SingleR package.

auto classic_markers = singlepp::choose_classic_markers(
ref_mat.get(),
ref_labels.data(),
m_opt
);
Markers< Index_ > choose_classic_markers(const std::vector< const tatami::Matrix< Value_, Index_ > * > &representatives, const std::vector< const Label_ * > &labels, const ChooseClassicMarkersOptions &options)
Definition choose_classic_markers.hpp:79
Options for choose_classic_markers().
Definition choose_classic_markers.hpp:40

The classic_markers can then be directly used in train_single(). Of course, other marker detection schemes can be used, depending on the type of reference dataset. For single-cell references, users may be interested in some of the differential analysis methods in the libscran library.

By default, it is expected that the markers supplied to train_single() has already been filtered to only the top markers for each pairwise comparison. However, in some cases, it might be more convenient for markers to contain a ranking of all genes such that the desired subset of top markers can be chosen later. This is achieved by setting TrainSingleOptions::top to the desired number of markers per comparison, e.g., for 20 markers:

train_opt.top = 20;
auto trained20 = singlepp::train_single(
ref_mat,
ref_labels.data(),
ref_markers,
train_opt
);
int top
Definition train_single.hpp:36

Doing so is roughly equivalent to slicing each vector in markers to the top 20 entries before calling train_single(). In fact, calling set_top() is the better approach when intersecting feature spaces - see below - as the top set will not be contaminated by genes that are not present in the test dataset.

Intersecting feature sets

Often the reference dataset will not have the same genes as the test dataset. To handle this case, users should use train_single_intersect() with identifiers for the rows of the reference and test matrices.

test_names; // vector of feature IDs for the test data
ref_names; // vector of feature IDs for the reference data
auto trained_intersect = singlepp::train_single_intersect(
test_mat.nrow(),
test_names.data(),
ref_mat,
ref_names.data(),
ref_labels.data(),
ref_markers,
train_opt
);
TrainedSingleIntersect< Index_, Float_ > train_single_intersect(const Intersection< Index_ > &intersection, const tatami::Matrix< Value_, Index_ > &ref, const Label_ *labels, Markers< Index_ > markers, const TrainSingleOptions< Index_, Float_ > &options)
Definition train_single.hpp:311

Then, classify_single_intersect() will perform classification using only the intersection of genes:

test_mat,
trained_intersect,
class_opt
);
void classify_single_intersect(const tatami::Matrix< Value_, Index_ > &test, const TrainedSingleIntersect< Index_, Float_ > &trained, const ClassifySingleBuffers< Label_, Float_ > &buffers, const ClassifySingleOptions< Float_ > &options)
Definition classify_single.hpp:169

The gene identifiers can be anything that can be hashed and compared. These are most commonly std::strings but can also be integers (e.g., for Entrez IDs).

Integrating results across references

To combine results from multiple references, we first need to perform classification within each reference. Let's say we have two references A and B:

auto trainA = singlepp::train_single(refA_mat, refA_labels.data(), refA_markers, train_opt);
auto resA = singlepp::classify_single(test_mat, trainA, class_opt);
auto trainB = singlepp::train_single(refB_mat, refB_labels.data(), refB_markers, train_opt);
auto resB = singlepp::classify_single(test_mat, trainB, class_opt);

We build the integrated classifier:

std::vector<singlepp::TrainIntegratedInput<> > inputs;
inputs.push_back(singlepp::prepare_integrated_input(refA_mat, refA_labels.data(), preA));
inputs.push_back(singlepp::prepare_integrated_input(refB_mat, refB_labels.data(), preB));
auto train_integrated = singlepp::train_integrated(inputs, ti_opt);
TrainIntegratedInput< Value_, Index_, Label_ > prepare_integrated_input(const tatami::Matrix< Value_, Index_ > &ref, const Label_ *labels, const TrainedSingle< Index_, Float_ > &trained)
Definition train_integrated.hpp:73
TrainedIntegrated< Index_ > train_integrated(const std::vector< TrainIntegratedInput< Value_, Index_, Label_ > > &inputs, const TrainIntegratedOptions &options)
Definition train_integrated.hpp:453
Options for train_integrated().
Definition train_integrated.hpp:265

And then we can finally run the scoring. For each cell in the test dataset, classify_integrated() picks the best label among the assignments from each individual reference.

auto ires = single.run(test_mat, train_integrated, ci_opt);
ires.best; // index of the best reference.
Options for classify_integrated().
Definition classify_integrated.hpp:27

Building projects

CMake with FetchContent

If you're using CMake, you just need to add something like this to your CMakeLists.txt:

include(FetchContent)
FetchContent_Declare(
singlepp
GIT_REPOSITORY https://github.com/singler-inc/singlepp
GIT_TAG master # or any version of interest
)
FetchContent_MakeAvailable(singlepp)

Then you can link to singlepp to make the headers available during compilation:

# For executables:
target_link_libraries(myexe singlepp)
# For libaries
target_link_libraries(mylib INTERFACE singlepp)

CMake with find_package()

find_package(singler_singlepp CONFIG REQUIRED)
target_link_libraries(mylib INTERFACE singler::singlepp)

To install the library, use:

mkdir build && cd build
cmake .. -DSINGLEPP_TESTS=OFF
cmake --build . --target install

By default, this will use FetchContent to fetch all external dependencies. If you want to install them manually, use -DSINGLEPP_FETCH_EXTERN=OFF. See the tags in extern/CMakeLists.txt to find compatible versions of each dependency.

Manual

If you're not using CMake, the simple approach is to just copy the files in include/ - either directly or with Git submodules - and include their path during compilation with, e.g., GCC's -I. This requires the external dependencies listed in extern/CMakeLists.txt, which also need to be made available during compilation.

References

Aran D et al. (2019). Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163-172