Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queryability of dataframe attribute columns #99

Closed
johnkerl opened this issue May 19, 2022 · 1 comment
Closed

Queryability of dataframe attribute columns #99

johnkerl opened this issue May 19, 2022 · 1 comment
Assignees

Comments

@johnkerl
Copy link
Member

johnkerl commented May 19, 2022

Overview

  • Users are going to want to have unicode values in their obs/var columns
  • Unicode values are not currently queryable in the TileDB query-condition logic
    • Error message: tiledb.cc.TileDBError: [TileDB::QueryCondition] Error: Clause non-empty attribute may only be var-sized for ASCII strings: cell_type
  • Workaround for the near term:
    • Force dataframe columns to be stored as ASCII
    • At write time, this works — "α,β,γ" stores as "\xce\xb1,\xce\xb2,\xce\xb3"
    • At read time: since SOMA is an API: utf8-decode those strings when a query is done & give the user back "α,β,γ"
  • Parent context: Reversible typecasts (parent task) #106.

Details

Core issue

As of summer 2022:

  • String dims can only be ASCII -- core does not support Unicode storage for them
  • String attrs may be Unicode -- but are not queryable as detailed below
  • Therefore we:
    • Convert Unicode to ASCII on writes
    • Convert ASCII to Unicode on reads

Comparison of writes

$ tools/ingestor ~/s/a/Krasnow.h5ad ./Krasnow-good

# edit annotation_dataframe.py
$ git diff
diff --git a/apis/python/src/tiledbsc/annotation_dataframe.py b/apis/python/src/tiledbsc/annotation_dataframe.py
index 97941be8..627861fc 100644
--- a/apis/python/src/tiledbsc/annotation_dataframe.py
+++ b/apis/python/src/tiledbsc/annotation_dataframe.py
@@ -295,11 +295,11 @@ class AnnotationDataFrame(TileDBArray):
         #
         # TODO: when UTF-8 attributes are queryable using TileDB-Py's QueryCondition API we can remove this.
         column_types = {}
-        for column_name in dataframe.keys():
-            dfc = dataframe[column_name]
-            if len(dfc) > 0 and type(dfc[0]) == str:
-                # Force ASCII storage if string, in order to make obs/var columns queryable.
-                column_types[column_name] = np.dtype("S")
+#        for column_name in dataframe.keys():
+#            dfc = dataframe[column_name]
+#            if len(dfc) > 0 and type(dfc[0]) == str:
+#                # Force ASCII storage if string, in order to make obs/var columns queryable.
+#                column_types[column_name] = np.dtype("S")

         tiledb.from_pandas(
             uri=self.uri,

$ tools/ingestor ~/s/a/Krasnow.h5ad ./Krasnow-bad
  • Look at array schemas:
>>> with tiledb.open('Krasnow-bad/obs') as B:
...     print(B.schema)
...
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    ...
    Attr(name='cell_type', dtype='<U0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    ...
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)
>>> with tiledb.open('Krasnow-good/obs') as G:
...     print(G.schema)
...
ArraySchema(
  domain=Domain(*[
    Dim(name='obs_id', domain=(None, None), tile=None, dtype='|S0', var=True, filters=FilterList([ZstdFilter(level=-1), ])),
  ]),
  attrs=[
    ...
    Attr(name='cell_type', dtype='|S0', var=True, nullable=False, filters=FilterList([ZstdFilter(level=-1), ])),
    ...
  ],
  cell_order='row-major',
  tile_order='row-major',
  capacity=100000,
  sparse=True,
  allows_duplicates=False,
)

Comparison of reads

If we try to query the Krasnow-bad SOMA -- the one where we omitted the crucial Unicode-to-ASCII on write -- then the query fails entirely due to the aformentioned core issue:

Query script (copied out of https://github.com/single-cell-data/TileDB-SingleCell/blob/0.1.7/apis/python/src/tiledbsc/annotation_dataframe.py#L194-L221 for brevity in presentation here):

#!/usr/bin/env python

import tiledb
import pandas as pd
import sys

# ----------------------------------------------------------------
def ascii_to_unicode_dataframe_readback(df: pd.DataFrame) -> pd.DataFrame:
    """
    Implements the 'decode on read' part of our logic.
    """
    for k in df:
        dfk = df[k]
        if len(dfk) > 0 and type(dfk[0]) == bytes:
            df[k] = dfk.map(lambda e: e.decode())
    return df

# ----------------------------------------------------------------
obs_uri = "/Users/johnkerl/s/t/Krasnow/obs"
if len(sys.argv) == 2:
    obs_uri = sys.argv[1]

cfg = tiledb.Config()
cfg["py.init_buffer_bytes"] = 4 * 1024**3
ctx = tiledb.Ctx(cfg)

with tiledb.open(obs_uri, ctx=ctx) as O:
    qc = tiledb.QueryCondition('cell_type == "lung neuroendocrine cell"')
    slice_query = O.query(attr_cond=qc, attrs=['cell_type'])
    df = slice_query.df[:]
    print("Without ASCII-to-Unicode on read:")
    print(df)
    print()
    print("With ASCII-to-Unicode on read:")
    print(ascii_to_unicode_dataframe_readback(df))

Script output when run on Krasnow-bad which lacks the crucial Unicode-to-ASCII on write:

$ repro.py ./Krasnow-bad/obs
Traceback (most recent call last):
  File "/Users/johnkerl/git/single-cell-data/TileDB-SingleCell/apis/python/./repro.py", line 30, in <module>
    df = slice_query.df[:]
  File "/Users/johnkerl/git/TileDB-Inc/TileDB-Py/tiledb/multirange_indexing.py", line 210, in __getitem__
    return self if self.return_incomplete else self._run_query()
  File "/Users/johnkerl/git/TileDB-Inc/TileDB-Py/tiledb/multirange_indexing.py", line 341, in _run_query
    self.pyquery.submit()
tiledb.cc.TileDBError: [TileDB::QueryCondition] Error: Value node non-empty attribute may only be var-sized for ASCII strings: cell_type

Next query the `Krasnow-good' SOMA, which has the crucial Unicode-to-ASCII on write, but without and with the ASCII-to-Unicode conversion on read:

$ repro.py ./Krasnow-good/obs
Without ASCII-to-Unicode on read:
                                         cell_type
obs_id
P3_2_TTCGGTCTCCCTTGCA  b'lung neuroendocrine cell'
P3_3_AACCGCGTCGCGATCG  b'lung neuroendocrine cell'
P3_3_ACGGCCAGTCGATTGT  b'lung neuroendocrine cell'
P3_3_CCATGTCGTCTAAAGA  b'lung neuroendocrine cell'
P3_3_CTCGGAGAGTTAGCGG  b'lung neuroendocrine cell'
P3_3_CTTGGCTTCGCCCTTA  b'lung neuroendocrine cell'
P3_3_GACAGAGCATAACCTG  b'lung neuroendocrine cell'
P3_3_GCTCTGTTCTATCGCC  b'lung neuroendocrine cell'
P3_3_TGCGGGTTCCTGTAGA  b'lung neuroendocrine cell'
P3_3_TTAGGACCAATGGACG  b'lung neuroendocrine cell'
P3_6_GCCTCTATCCGCATCT  b'lung neuroendocrine cell'

With ASCII-to-Unicode on read:
                                      cell_type
obs_id
P3_2_TTCGGTCTCCCTTGCA  lung neuroendocrine cell
P3_3_AACCGCGTCGCGATCG  lung neuroendocrine cell
P3_3_ACGGCCAGTCGATTGT  lung neuroendocrine cell
P3_3_CCATGTCGTCTAAAGA  lung neuroendocrine cell
P3_3_CTCGGAGAGTTAGCGG  lung neuroendocrine cell
P3_3_CTTGGCTTCGCCCTTA  lung neuroendocrine cell
P3_3_GACAGAGCATAACCTG  lung neuroendocrine cell
P3_3_GCTCTGTTCTATCGCC  lung neuroendocrine cell
P3_3_TGCGGGTTCCTGTAGA  lung neuroendocrine cell
P3_3_TTAGGACCAATGGACG  lung neuroendocrine cell
P3_6_GCCTCTATCCGCATCT  lung neuroendocrine cell

Here we see that if we omit the ASCII-to-Unicode on read, we are handing the user back a dataframe which has columns as bytes, not strings, which breaks the UX for them. (In particular, any users of non-English languages with things like umlauts or accents in their column data are going to have a negative experience.)

@johnkerl
Copy link
Member Author

johnkerl commented Jul 5, 2022

Nothing more to do here.

When we have true Unicode values on disk, that will be an internal change, and existing unit-test cases will ensure we don't have a regression.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant