Queryability of dataframe attribute columns #99

johnkerl · 2022-05-19T15:34:46Z

Overview

Users are going to want to have unicode values in their obs/var columns
Unicode values are not currently queryable in the TileDB query-condition logic
- Error message: tiledb.cc.TileDBError: [TileDB::QueryCondition] Error: Clause non-empty attribute may only be var-sized for ASCII strings: cell_type
Workaround for the near term:
- Force dataframe columns to be stored as ASCII
- At write time, this works — "α,β,γ" stores as "\xce\xb1,\xce\xb2,\xce\xb3"
- At read time: since SOMA is an API: utf8-decode those strings when a query is done & give the user back "α,β,γ"
Parent context: Reversible typecasts (parent task) #106.

Details

Core issue

As of summer 2022:

String dims can only be ASCII -- core does not support Unicode storage for them
String attrs may be Unicode -- but are not queryable as detailed below
Therefore we:
- Convert Unicode to ASCII on writes
- Convert ASCII to Unicode on reads

Comparison of writes

Referencing tiledbsc-py 0.1.7 just to make this a stable permalink:
- https://github.com/single-cell-data/TileDB-SingleCell/blob/0.1.7/apis/python/src/tiledbsc/annotation_dataframe.py#L271-L302
- The comments there also explain some of what's explained here.
Write two copies of a SOMA, one with that convert of Unicode to ASCII on write commented out, and one without it commented out

$ tools/ingestor ~/s/a/Krasnow.h5ad ./Krasnow-good

# edit annotation_dataframe.py
$ git diff
diff --git a/apis/python/src/tiledbsc/annotation_dataframe.py b/apis/python/src/tiledbsc/annotation_dataframe.py
index 97941be8..627861fc 100644
--- a/apis/python/src/tiledbsc/annotation_dataframe.py
+++ b/apis/python/src/tiledbsc/annotation_dataframe.py
@@ -295,11 +295,11 @@ class AnnotationDataFrame(TileDBArray):
         #
         # TODO: when UTF-8 attributes are queryable using TileDB-Py's QueryCondition API we can remove this.
         column_types = {}
-        for column_name in dataframe.keys():
-            dfc = dataframe[column_name]
-            if len(dfc) > 0 and type(dfc[0]) == str:
-                # Force ASCII storage if string, in order to make obs/var columns queryable.
-                column_types[column_name] = np.dtype("S")
+#        for column_name in dataframe.keys():
+#            dfc = dataframe[column_name]
+#            if len(dfc) > 0 and type(dfc[0]) == str:
+#                # Force ASCII storage if string, in order to make obs/var columns queryable.
+#                column_types[column_name] = np.dtype("S")

         tiledb.from_pandas(
             uri=self.uri,

$ tools/ingestor ~/s/a/Krasnow.h5ad ./Krasnow-bad

Look at array schemas:

>>> with tiledb.open('Krasnow-bad/obs') as B:
...     print(B.schema)
...
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    ...
    Attr(name='cell_type', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    ...
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

>>> with tiledb.open('Krasnow-good/obs') as G:
...     print(G.schema)
...
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    ...
    Attr(name='cell_type', dtype='|S0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    ...
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

Comparison of reads

If we try to query the Krasnow-bad SOMA -- the one where we omitted the crucial Unicode-to-ASCII on write -- then the query fails entirely due to the aformentioned core issue:

Query script (copied out of https://github.com/single-cell-data/TileDB-SingleCell/blob/0.1.7/apis/python/src/tiledbsc/annotation_dataframe.py#L194-L221 for brevity in presentation here):

#!/usr/bin/env python

import tiledb
import pandas as pd
import sys

# ----------------------------------------------------------------
def ascii_to_unicode_dataframe_readback(df: pd.DataFrame) -> pd.DataFrame:
    """
    Implements the 'decode on read' part of our logic.
    """
    for k in df:
        dfk = df[k]
        if len(dfk) > 0 and type(dfk[0]) == bytes:
            df[k] = dfk.map(lambda e: e.decode())
    return df

# ----------------------------------------------------------------
obs_uri = "/Users/johnkerl/s/t/Krasnow/obs"
if len(sys.argv) == 2:
    obs_uri = sys.argv[1]

cfg = tiledb.Config()
cfg["py.init_buffer_bytes"] = 4 * 1024**3
ctx = tiledb.Ctx(cfg)

with tiledb.open(obs_uri, ctx=ctx) as O:
    qc = tiledb.QueryCondition('cell_type == "lung neuroendocrine cell"')
    slice_query = O.query(attr_cond=qc, attrs=['cell_type'])
    df = slice_query.df[:]
    print("Without ASCII-to-Unicode on read:")
    print(df)
    print()
    print("With ASCII-to-Unicode on read:")
    print(ascii_to_unicode_dataframe_readback(df))

Script output when run on Krasnow-bad which lacks the crucial Unicode-to-ASCII on write:

$ repro.py ./Krasnow-bad/obs
Traceback (most recent call last):
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/./repro.py", line 30, in <module>
    df = slice_query.df[:]
  File "/Users/johnkerl/git/TileDB-Inc/TileDB-Py/tiledb/multirange_indexing.py", line 210, in __getitem__
    return self if self.return_incomplete else self._run_query()
  File "/Users/johnkerl/git/TileDB-Inc/TileDB-Py/tiledb/multirange_indexing.py", line 341, in _run_query
    self.pyquery.submit()
tiledb.cc.TileDBError: [TileDB::QueryCondition] Error: Value node non-empty attribute may only be var-sized for ASCII strings: cell_type

Next query the `Krasnow-good' SOMA, which has the crucial Unicode-to-ASCII on write, but without and with the ASCII-to-Unicode conversion on read:

$ repro.py ./Krasnow-good/obs
Without ASCII-to-Unicode on read:
                                         cell_type
obs_id
P3_2_TTCGGTCTCCCTTGCA  b'lung neuroendocrine cell'
P3_3_AACCGCGTCGCGATCG  b'lung neuroendocrine cell'
P3_3_ACGGCCAGTCGATTGT  b'lung neuroendocrine cell'
P3_3_CCATGTCGTCTAAAGA  b'lung neuroendocrine cell'
P3_3_CTCGGAGAGTTAGCGG  b'lung neuroendocrine cell'
P3_3_CTTGGCTTCGCCCTTA  b'lung neuroendocrine cell'
P3_3_GACAGAGCATAACCTG  b'lung neuroendocrine cell'
P3_3_GCTCTGTTCTATCGCC  b'lung neuroendocrine cell'
P3_3_TGCGGGTTCCTGTAGA  b'lung neuroendocrine cell'
P3_3_TTAGGACCAATGGACG  b'lung neuroendocrine cell'
P3_6_GCCTCTATCCGCATCT  b'lung neuroendocrine cell'

With ASCII-to-Unicode on read:
                                      cell_type
obs_id
P3_2_TTCGGTCTCCCTTGCA  lung neuroendocrine cell
P3_3_AACCGCGTCGCGATCG  lung neuroendocrine cell
P3_3_ACGGCCAGTCGATTGT  lung neuroendocrine cell
P3_3_CCATGTCGTCTAAAGA  lung neuroendocrine cell
P3_3_CTCGGAGAGTTAGCGG  lung neuroendocrine cell
P3_3_CTTGGCTTCGCCCTTA  lung neuroendocrine cell
P3_3_GACAGAGCATAACCTG  lung neuroendocrine cell
P3_3_GCTCTGTTCTATCGCC  lung neuroendocrine cell
P3_3_TGCGGGTTCCTGTAGA  lung neuroendocrine cell
P3_3_TTAGGACCAATGGACG  lung neuroendocrine cell
P3_6_GCCTCTATCCGCATCT  lung neuroendocrine cell

Here we see that if we omit the ASCII-to-Unicode on read, we are handing the user back a dataframe which has columns as bytes, not strings, which breaks the UX for them. (In particular, any users of non-English languages with things like umlauts or accents in their column data are going to have a negative experience.)

The text was updated successfully, but these errors were encountered:

johnkerl · 2022-07-05T14:55:20Z

Nothing more to do here.

When we have true Unicode values on disk, that will be an internal change, and existing unit-test cases will ensure we don't have a regression.

johnkerl self-assigned this May 19, 2022

johnkerl added the active label May 19, 2022

This was referenced May 19, 2022

Store to-be-queried obs/var columns as ASCII (workaround) #101

Merged

Reversible typecasts (parent task) #106

Closed

This was referenced May 27, 2022

Add an ingestor option to only update obs/var, nothing else #132

Merged

Extend ASCII-queryability workaround #138

Merged

Correct ASCII-to-Unicode readback for attribute_filter #141

Merged

johnkerl closed this as completed Jul 5, 2022

johnkerl removed the active label Jul 5, 2022

johnkerl mentioned this issue Jul 23, 2022

SOMA API v1 experiments [WIP/RFC] #227

Closed

This was referenced Aug 25, 2022

[python] Support return_arrow for various queries #256

Merged

Improving Usability of ASCII Strings TileDB-Inc/TileDB-Py#1304

Draft

Restore ability to store and query non-ASCII dataframe attributes #274

Closed

This was referenced Sep 16, 2022

[python] Update ASCII storage for dataframes #273

Merged

TileDB-API support needs #302

Closed

johnkerl mentioned this issue Oct 3, 2022

[python] Use true ASCII attributes in dataframes #359

Merged

johnkerl mentioned this issue Oct 31, 2023

[r] Write string attrs as UTF-8 (Python compatibility) #1843

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queryability of dataframe attribute columns #99

Queryability of dataframe attribute columns #99

johnkerl commented May 19, 2022 •

edited

Loading

johnkerl commented Jul 5, 2022

Queryability of dataframe attribute columns #99

Queryability of dataframe attribute columns #99

Comments

johnkerl commented May 19, 2022 • edited Loading

Overview

Details

Core issue

Comparison of writes

Comparison of reads

johnkerl commented Jul 5, 2022

johnkerl commented May 19, 2022 •

edited

Loading