-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* docs: readme rewrite * docs: fix emails * docs: editorial fixes * Editorial changes, feedback has been incorporated * Add sentence * Address PR requests * Fix typo * Fix typo * Remove duplicated word * Remove info * Fix link
- Loading branch information
Showing
1 changed file
with
38 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,51 @@ | ||
## Vision | ||
# SOMA | ||
|
||
Datasets generated by profiling single cells are rapidly increasing in size and complexity. This has resulted in a need for scalable solutions to accommodate data sizes that no longer fit in memory and flexibility to accommodate the diversity of data being produced. To address these emerging needs in the single cell ecosystem, CZI, in partnership with the Feature and Observation Matrix (FOM) Schema Working Group and TileDB, is launching three projects. | ||
SOMA – for “Stack Of Matrices, Annotated” – is a flexible, extensible, and open-source API enabling access to data in a variety of formats. SOMA is designed to be general-purpose for data that can be modeled as one or more sets of 2D annotated matrices with measurements of features across observations. | ||
The driving use case of SOMA is for single-cell data in the form of annotated matrices where observations are frequently cells and features are genes, proteins, or genomic regions. | ||
|
||
**1. SOMA.** | ||
|
||
SOMA, “stack of matrices, annotated,” is a flexible, extensible, and open-source API enabling access to data in a variety of formats. The vision for this API is that it will enable single cell datasets, including those with multiple modalities, to be stored in a cloud-friendly format and will be easily queryable, sliceable, and streamable without downloading or copying the full data. SOMA is designed to be general purpose and is grounded in the core assumption that the data can be modeled as a set of 2D annotated matrices that describe measurements of features across observations. | ||
|
||
The first implementation of the SOMA API is currently being built in partnership with [TileDB](https://tiledb.com) on top of their open-source (under the MIT License) [TileDB Embedded](https://tiledb.com/products/tiledb-embedded) storage engine. Both the Python and R APIs will be delivered incrementally in the upcoming months. | ||
## Motivation | ||
|
||
**2. Pilot project to offer CZ cellxgene’s Data Resource (30M+ cells and growing) via the SOMA API.** | ||
Datasets generated by profiling single cells are rapidly increasing in size and complexity. This has resulted in a need for scalable solutions to accommodate data sizes that no longer fit in memory and flexibility to accommodate the diversity of data being produced. | ||
|
||
CZI is working on a pilot project to offer the entirety of cellxgene’s standardized single cell data resource, containing over 30 million cells, as a set of SOMA-backed objects. This resource, paired with the Python and R SOMA APIs bindings, will enable scientists to query, slice, and stream a subset of the data for analysis in downstream single cell toolkits. | ||
To address these emerging needs in the single cell ecosystem, the Chan Zuckerberg Initiative in partnership with TileDB is: | ||
|
||
Over the next few months, as the SOMA data model and API definition work mature, this resource will be offered to the public with accompanying notebooks demonstrating how to make use of the resource. | ||
|
||
**3. Supplementary domain specific libraries and schemas.** | ||
1. Driving the development of SOMA. | ||
2. Providing its first implementation, [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA) which utilizes the [TileDB Embedded](https://github.com/TileDB-Inc/TileDB) engine. | ||
3. Adopting TileDB-SOMA at [CZ CELLxGENE Discover](https://cellxgene.cziscience.com/) to build the [Cell Census](https://github.com/chanzuckerberg/cell-census/) which provides efficient access and querying to a corpus containing nearly 50 million cells, compiled from 700+ datasets. | ||
|
||
While SOMA is intended to be general purpose, two efforts are underway to build on top of the core SOMA format to make its use more single cell domain adaptable. | ||
The `SOMA` specification and its `TileDB-SOMA` implementation provide the following capabilities for single-cell data: | ||
|
||
The first effort is a set of schemas defining how to capture both multimodal and unimodal single cell data, which will be used as the basis for cellxgene data available via the SOMA API. From the outset, SOMA’s APIs will be designed to interface with multimodal data; the API will enable querying and slicing multimodal datasets along any schema-defined axis. | ||
1. An abstract specification with flexibility for data from multiple modalities (e.g. RNA, spatial, epigenomics) | ||
1. A format to store and access datasets larger than memory, as compared to the current paradigm of `.h5ad`/`.mtx`/`.tgz`/`.RData`/`.h5Seurat`/ etc. | ||
1. Eliminates in-memory limitations by providing query-ready data management for reading and writing at low latency and cloud scale. | ||
1. R and python APIs with the flexibility to expand to other languages. | ||
|
||
The second effort is a supplemental library called SOMA.io that will enable users to convert SOMA-backed objects to and from the two most popular domain-specific formats: anndata and Seurat. | ||
|
||
Both these efforts will be released in the upcoming months in parallel to the two projects above. | ||
## Developer information | ||
|
||
* [SOMA abstract specification](https://github.com/single-cell-data/SOMA/blob/main/abstract_specification.md) — language-agnostic SOMA API specification. | ||
* [Python SOMA specification](https://github.com/single-cell-data/SOMA/tree/main/python-spec) — persistence-layer–agnostic Python definition of SOMA core types. | ||
* [TileDB-SOMA](https://github.com/single-cell-data/TileDB-SOMA) — Python and R implementation of SOMA specification using [TileDB Embedded](https://github.com/TileDB-Inc/TileDB). R coming soon. | ||
|
||
## Coming soon! | ||
|
||
* R SOMA specification and its implementation through TileDB-SOMA. | ||
* End-user documentation for both Python and R TileDB-SOMA APIs, including a getting-started guide, notebooks, and API reference. | ||
|
||
|
||
|
||
## Issues and contacts | ||
|
||
* We expect the TileDB-SOMA repository to be the front door for reporting and tracking implementation issues [https://github.com/single-cell-data/TileDB-SOMA/issues](https://github.com/single-cell-data/TileDB-SOMA/issues). In addition, for spec-related issues please submit an issue at [https://github.com/single-cell-data/SOMA/issues](https://github.com/single-cell-data/SOMA/issues). | ||
* If you believe you have found a security issue, in lieu of filing an issue please responsibly disclose it by contacting [[email protected]](mailto:[email protected]). | ||
* Feedback is appreciated, as this is a community-driven project. If you have well-scoped features/discussions please add them to [https://github.com/single-cell-data/SOMA/issues](https://github.com/single-cell-data/SOMA/issues). For any other inquiries please reach out to [[email protected]](mailto:[email protected]). | ||
* If you would like to learn more about SOMA or would like to keep up to date with the latest developments, please join our mailing list [here](https://bit.ly/soma-signup). | ||
|
||
|
||
## Code of Conduct | ||
|
||
This project adheres to CZI's Contributor Covenant [code of conduct](https://github.com/chanzuckerberg/.github/blob/master/CODE_OF_CONDUCT.md). By participating, you are expected to uphold this code. Please report unacceptable behavior to <[email protected]>. | ||
|
||
CZI is actively soliciting input from the community to help us refine and extend our roadmap. Please reach out to us at [[email protected]](mailto:[email protected]) with your ideas! If you would like to learn more about SOMA or would like to keep up to date with the latest developments, please join our mailing list [here](https://bit.ly/soma-signup). |