Skip to content

Python quick start

John Kerl edited this page May 24, 2024 · 6 revisions

Installation

TileDB-SOMA is available on PyPI and Conda, and can be installed via pip or mamba as indicated below. Full installation instructions can be found here.

python -m pip install tiledbsoma
mamba install -c conda-forge -c tiledb tiledbsoma-py

In case of illegal instruction errors when running on older architectures --- e.g. Opteron, non-AVX2 --- the issue is that the pre-compiled binaries available at Conda or PyPI aren't targeted for all processor variants over time. You can use

git clone https://github.com/single-cell-data/TileDB-SOMA.git
pip install -v -e TileDB-SOMA/apis/python

to effect a local compile. You'll need cmake on your system.

Python package documentation

https://tiledbsoma.readthedocs.io/en/latest/python-api.html

Usage examples

Building a SOMA object

SOMA objects can be created with their respective create() methods and then need to be populated in specific ways depending on their types.

However, a SOMAExperiment can be easily created from and anndata object or a *h5ad file. Here, one is created from a *.h5ad file.

import tiledbsoma.io

# Create and write a SOMA Experiment, source data https://github.com/chanzuckerberg/cellxgene/raw/main/example-dataset/pbmc3k.h5ad
pbmc3k_uri = tiledbsoma.io.from_h5ad("./pbmc3k", input_path = "pbmc3k.h5ad", measurement_name = "RNA")

Reading and querying SOMA objects

SOMA objects can be opened using tildedbsoma.open().

The contents of DataFrame, SparseNDArray and DenseNDArray can be accessed with their respective read() methods. For DataFrame and SparseNDArray the method returns an iterator useful for larger-than-memory operations.

For example you can open the SOMAExperiment created above and then read the contents of obs which is a SOMADataFrame.

In addition, this example shows how you can query for observations with louvian values of 'Megakaryocytes' and 'CD4 T cells', and n_genes greater than 500.

import tiledbsoma

with tiledbsoma.open(pbmc3k_uri) as pbmc3k_soma:
    pbmc3k_obs_slice = pbmc3k_soma.obs.read(
        value_filter="n_genes >500 and louvain in ['Megakaryocytes', 'CD4 T cells']"
    )
    
    # Concatenate iterator to pyarrow.Table
    pbmc3k_obs_slice.concat()

The result is a pyarrow.Table containing a slice based on the specified filters.

pyarrow.Table
soma_joinid: int64
obs_id: large_string
n_genes: int64
percent_mito: float
n_counts: float
louvain: large_string
----
soma_joinid: [[0,2,8,11,12,...,2617,2621,2626,2631,2637]]
obs_id: [["AAACATACAACCAC-1","AAACATTGATCAGC-1","AAACGCTGTAGCCA-1","AAACTTGATCCAGA-1","AAAGAGACGAGATA-1",...,"TTGTAGCTAGCTCA-1","TTTAGCTGATACCG-1","TTTCACGAGGTTCA-1","TTTCCAGAGGTGAG-1","TTTGCATGCCTCAC-1"]]
n_genes: [[781,1131,533,751,866,...,933,887,721,873,724]]
percent_mito: [[0.030177759,0.008897362,0.011764706,0.010887772,0.010788382,...,0.02224871,0.022875817,0.013261297,0.0068587107,0.008064516]]
n_counts: [[2419,3147,1275,2388,2410,...,2517,2754,2036,2187,1984]]
louvain: [["CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells",...,"CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells","CD4 T cells"]]

Iterators for larger-than-memory operations

As stated above the read() methods of DataFrame and SparseNDArray return an iterator. The batch size can be specified a in the soma.init_buffer_bytes config option, for this is example it is set to 100 Bytes:

context = tiledbsoma.options.SOMATileDBContext()
context = context.replace(tiledb_config = {"soma.init_buffer_bytes": 100})

with tiledbsoma.open(pbmc3k_uri, context = context) as pbmc3k_soma:
    
    pbmc3k_obs = pbmc3k_soma.obs.read()
  
    counter = 1
    for pbmc3k_obs_chunk in pbmc3k_obs:
        
        # Perform operations
        # pbmc3k_obs_chunk is a pyArrow.Table
        
        counter += 1

print(counter)

The counter indicates the number of iterations performed

441