
SCANPY integrates well into the existing Python ecosystem, in which no comparable toolkit yet exists.ĭuring the revision of this article, the loom file format ( ) was proposed for HDF5-based storage of annotated data. The transfer of the results obtained with different tools used within the community is simple, as SCANPY’s data storage formats and objects are language independent and cross-platform. Moreover, being implemented in a highly modular fashion, SCANPY can be easily developed further and maintained by a community. SCANPY’s scalability directly addresses the strongly increasing need for aggregating larger and larger data sets across different experimental setups, for example within challenges such as the Human Cell Atlas. Typically, SCANPY’s tools reuse a once-computed, single graph representation of data and hence, avoid the use of different, potentially inconsistent, and computationally expensive representations of data. Moreover, the class provides several functions to compute random-walk-based metrics that are not available in other graph software. This is achieved by aggregating rows (observations) in a data matrix to submatrices and computing distances for each submatrix using fast parallelized matrix multiplication. The computation of neighborhood relations is much faster than in the popular reference package. SCANPY introduces a class for representing a graph of neighborhood relations among data points. Almost all of SCANPY’s tools are parallelized. Pipelines written this way can then also be run in backed mode to exploit online-learning formulations of algorithms. To simplify memory-efficient pipelines, SCANPY’s functions operate in-place by default but allow the optional non-destructive transformation of objects. This allows operating on an ANNDATA object without fully loading it into memory-the functionality is offered via ANNDATA’s backed mode as opposed to its memory mode. ANNDATA is similar to R’s EXPRESSIONSET, but supports sparse data and allows HDF5-based backing of ANNDATA objects on disk, a format independent of platform, framework, and language. All statistics and machine-learning tools extract information from a data matrix, which can be added to an ANNDATA object while leaving the structure of ANNDATA unaffected. As SCANPY is built around that class, it is easy to add new functionality to the toolkit. With SCANPY, we introduce the class ANNDATA-with a corresponding package ANNDATA-which stores a data matrix with the most general annotations possible: annotations of observations (samples, cells) and variables (features, genes), and unstructured annotations. SCANPY introduces efficient modular implementation choices In addition to the mentioned standard clustering-based analyses approaches, we demonstrate the reconstruction of branching developmental processes via diffusion pseudotime as in the original paper ( ), the simulation of single cells using literature-curated gene regulatory networks based on the ideas of ( ), and the analysis of deep-learning results for single-cell imaging data ( ). Thus, SCANPY provides tools with speedups that enable an analysis of data sets with more than one million cells and an interactive analysis with run times of the order of seconds for about 100,000 cells. Moreover, we demonstrate the feasibility of analyzing 1.3 million cells without subsampling in a few hours of computing time on eight cores of a small computing server (Fig. Benchmarking against the more run-time optimized CELL RANGER R kit, we demonstrate a speedup of 5 to 16 times for a data set of 68,579 PBMCs (Fig. In a detailed clustering tutorial of 2700 peripheral blood mononuclear cells (PBMCs), adapted from one of SEURAT’s tutorials ( ), all steps starting from raw count data to the identification of cell types are carried out, providing speedups between 5 and 90 times in each step ( ). Full size image SCANPY is benchmarked in comparisons with established packages
