-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SpatialData specs proposal #12
Comments
Hey @giovp . Thanks for this! Just to clarify, are these the "building block" classes/types that you are proposing?\ If so, this overall looks good to me. I have left a couple of minor comments below. As a follow up, I would be interested in seeing what the building block API might look like (i.e, what properties and methods these objects will have).
Do we want to explicitly have different types of tables based on what they are annotating (e.g.,
May be of interest: example of storing polygons in an AnnData table https://forum.image.sc/t/roi-annotations-and-tracking-information-in-ome-zarr/65975 |
What about something like this for the API of the building blocks to start? Based on #11 , it looks pretty similar to what @giovp and @LucaMarconato are thinking. class BaseBuildingBlock:
# store the actual data (e.g., array for image, coordinates for points, etc.)
data: Any
# the annotation table (if present)
annotations: Optional[AnnData]
# store the transform objects as a dictionary with keys (source_coordinate_space, destination_coordinate_space)
transforms: Dict[Tuple[str, str], TransformObject]
def to_ome_ngff(path: str) -> None:
"""method to write to ome-ngff zarr file - could be called to_zarr?"""
@abc.abstractmethod
def transform(new_coordinate_space: str, inplace: bool = False):
"""convenience function to transform the object to a new coordinate space"""
@abc.abstractmethod
@classmethod
def from_path() -> BuildingBlock:
"""constructor for instantiating from paths (maybe split into local and remote?"""
raise NotImplementedError Things to be added:
Annotations with data?What do you all think about grouping the annotations and data together (e.g., label image + label table in the same building block object)? This makes a lot of things more convenient (e.g., querying the data based on annotation/feature values). However, then people who just want to process the tables will need to separate the tables from the building block object (e.g., To me it seems the convenience of grouping the annotations with the data seems worth it, as the user doesn't need to keep track of two objects and querying will be nicer. Thoughts? |
interesting, didn't think about this, I think it'd make sense potentially yes, although maybe also redundant? As in, if it's an annotation table, I feel it should have the same entries/attributes regardless whether it's a
https://forum.image.sc/t/roi-annotations-and-tracking-information-in-ome-zarr/65975 maybe I misunderstood but by looking at the repo, they are not storing polygons in anndata, but polygons features and the centroid positions. The polygons are stored as collections of csv files (couldn't open, I guess it's the vertices) in I think what we'd want is to store polygons directly in anndata, the same way that we store points. One way to do it is via awkward arrays or geopandas, the former might gets in anndata faster but I don't know what would be the support from ngff side. I believe it's best to wait what they want to do with polygons, maybe worth to ping on zulip to get general feeling of what directions do they want to take. |
great point, I think it makes sense of start discussing this indeed and get an early prototype working. I agree it might make sense. However, I'd also think that we might want to have a single anndata table for the object by default, as it might be easier to design an api around for cropping subsetting. I think it'd be nice to have something like this: sdata = SpatialData(
images = [image1, image2, image3],
labels = [labels1, labels2],
points = [points3],
tables = table
)
sdata[sdata.obs.cluster == "celltypeZ"]
>>> SpatialData with:
>>> images: [image1, image3]
>>> labels: [labels1]
>>> points: [points3]
>>> table: n_obs X n_vars
sdata.images[["image1","image2s"]].crop(0,2000,0,2000)
>>> SpatialData with:
>>> images: [image1, image2]
>>> labels: [labels1, labels2]
>>> points: []
>>> table: n_obs X n_vars it would be anyway possible if we store views of tables associated to labels. What do you think? |
I would enable views of tables associated to labels in separate SpatialData objects, either created on the fly and pointing to the same memory/file, but with the possibility to detach them and then merge the information back thanks to helper functions. |
Thanks for the feedback, @giovp . I'm not sure if I am understanding you correctly. Just to make sure I am on the same page, here is my current understanding of the classes:
Is that correct? If so, what does the "single table" represent? Would each Similarly, what is |
I agree that when slicing/sampling the |
that is correct, I would describe the class the same way.
potentially yes, but I'm thinking more and more that we might want to have a single, concatenated table to expose the table api at the higher level of the hierarchy for SpatialData. Let's consider the case of 1 Visium and 1 merfish slide in the same spatial dataset, of the same biological specimen sdata = SpatialData(
images = [images_visium, images_merfish],
labels = [labels_merfish],
points = [points_visium],
tables = [tables_visium, tables_merfish]
) Now let's say that the analyst wants to subset the object by only selecting cells of cluster sdata[sdata.obs.cluster == "celltypeZ"]
>>> SpatialData with:
>>> images = [images_visium, images_merfish],
>>> labels = [labels_merfish],
>>> points = [points_visium],
>>> table: n_obs X n_vars # with obs only of `celltypeZ` so in that case, e.g. if filter is on sdata[sdata.obs.cluster == "celltypeY"]
>>> SpatialData with:
>>> images = [images_merfish],
>>> labels = [labels_merfish],
>>> points = [],
>>> table: n_obs X n_vars # with obs only of `celltypeY` I understand this can be a stretch, and that we can come up with many counter-examples where this would not be desirable. However, I feel one strength of SpatialData could really be to have an AnnData-like API with additional methods to operate on the images. I'd love to hear @ivirshup thoughts on this. Anyway, just thoughts for now, maybe too early to discuss this ....
agree as well, if we decide to group ...
# the annotation table (if present)
annotations: Optional[AnnData] # view
... |
Thanks for the explanation, @giovp ! querying
I don't think it's too early to discuss. I think now is the right time to discuss what operations we want to support!
I agree with this goal! I would state it in a slightly different way: "SpatialData provides simple methods for querying across datasets contained within a `SpatialData" object. I am not sure it will make sense for the API to be too "AnnData-like" as the data are not strictly tabular. In your example, what would each row in I don't think we need to expose a single table in order to allow sampling across the different datasets. Instead, I think we can make a sampling API that allows the user to request data that meets specific criteria (e.g., has an obs column value and is within a certain spatial bounding box). This is basically a query language (e.g., sql, graphql), which sounds big, but I think it can be well scoped (e.g., just queries based on spatial information and obs to start). For spatial queries, I think we can take inspiration from xarray and for table queries, anndata, pandas, etc.. Ideally, our building blocks and Just spitballing, but perhaps something like this: sdata = SpatialData(
images = [images_visium, images_merfish],
labels = [labels_merfish],
points = [points_visium],
tables = [tables_visium, tables_merfish]
)
my_spatial_data_query = {
# add query by the spatial location
"spatial:" {
# get data within global coordinates x=0:10, x=5:15
"coordinate_space": "global"
"coordinates": {
"x": [0, 10],
"y": [5, 15]
}
},
# add query by the building block annotations
"annotations": {
# get data where obs.cluster == "celltypeY"
"obs": {
"column": "cluster",
"comparator: "equals",
"value": "celltypeY"
}
}
}
data_view = sdata.query(my_spatial_data_query ) Some advantages of this approach:
Some disadvantages:
table types
The types of queries above is why I think we want different types for |
thanks for the explanation @kevinyamauchi , I very much agree with your points but I think we could essentially support both style of query from the start. querying
an observation (either cell or spot).
potentially yes, but I believe if the idea is to analyze them together (in the same spatialdata object) then they would be minimal (e.g. same cell types etc.). Mudata is also an option we could consider already early on.
I don't think this is a problem, since both of these data representations are "regions", they have the same metadata as defined by ngff. The radius would be an attribute of the
Yes and we could have a mudata object for that. The storage would still be an anndata but the in-memory representation could be a mudata.
sorry don't understand what you mean here. If you mean a column mentioning which regions does the table annotate, this is already part of the ngff proposal (
I 100% agree on this, this type of query should be used for the image-level data and could be extend to operate on the table-level data. table types
this is interesting, I was thinking that an even easier approach (discussed briefly with @LucaMarconato ) was to store e.g. visium I think that even if we want to keep annotation table and "region" (labels, points) together (in the same class, as you proposed above), we could still keep the building block as simple as possible without the need to specify a subtype for the tables. this is very interesting discussion and indeed we should probably come to a conclusion relatively early on to the focus and build the API. I think I can totally be convinced that we should be focusing on the type of query you describe for both spatial and tables early on (and maybe even only supporting that). Yet I would not dismiss the anndata-style query just yet as I don't see it as only ergonomics . |
spatial queriesI like the idea, especially that you can serialize the queries. building blocks
I'll also comment on this. There are some imprecisions above. The original idea was to have
But then we opted for a more explicit alternative, which expands
From the storage point of view, points are circles are the same, except that they require different types of indices, so we are examining if we can simplify the structure above to:
This differs from the two previous options because now we allow also single molecule data to be annotated with rows from I am fully convinced on the advanced having a merged table with information across multiple samples, when the type of spatial information for each sample is the same. I also think that representing the table as a muon object would make the object more flexible. But I think that muon should not be used to represent annotations of different modalities when the obs are not of the same type and refer to overlapping spatial regions. That will be technically possible, but I would discourage it. In fact the power of SpatialData relies on the independence of the building blocks and the possibility to merge and split them at user convenience. For complex datasets I would use either a collection of SpatialData objects, or even better the idea of the SpatialDataPool (renamed SpatialDataContainer), that allows to create on the fly the SpatialData objects that one needs. To stay practical, since now we are coding the SpatialData class and not SpatialDataContainer, I would put a stress in coding methods that allow combining and separating building blocks. |
This is also what I have in mind, thanks for the clarification @LucaMarconato ! Also @kevinyamauchi this way we could have:
And from storage side they are both Tables, yet with same coordinateTransform and metadata as Labels.
👍
Could you elaborate on that? Don't think I fully get iut, isnt' this the purpose of mudata?
Agree, another option would be to support the spatial query for both |
In the case of spatial data you can have instead very different layers that overlap in space but not in a unique way, for instance if you have the output of two segmentation algorithms you will have overlapping regions but not pixel perfect, or if you have consecutive slides you can heuristically match instances, but if you change heuristic/alignment the mapping will be different. So the mapping between different obs is
Note that we can always create Another point is that in MuData the obs are all of the same type, like cells, while in our case we could have cells, regions, larger regions, etc. And to complicate the thing they can be all overlapping in space. So I think it would be more polished to make those tables live in different objects. A more appropriate data structure would not just allow multiple modalities (= different classes of vars), but would also allow different types of obs. @gtca is actually experimenting with this (maybe you could point to some code?), but this goes beyond MuData. |
Hey, I haven't updated myself on the discussion above but just in order to provide some context about the observation groups that @LucaMarconato mentioned, here's that. I'm experimenting with having a grid-like data structure, which is pretty lean but generalises both AnnData and MuData. # one scrna-seq dataset
data["pbmc3k","genes"]
# another multiome dataset
data["pbmc10k","genes"]
data["pbmc10k","peaks"] That gives other benefits but this is not an AnnData-like API. E.g. a table can be also defined across datasets: data["pbmc3k+pbmc10k","qc"] I am not sure this is the approach you need here but if you're curious about any details, let me know. |
thanks, this is very clear and very convincing. From this point of view then, mudata would really only be useful for "true" multimodal spatial data (e.g. visium with RNA + CITE-seq (or AIRR, see scverse/scirpy#354 )). |
Dialing back on the anndata-like query, i.e. the ability to subset SpatialData like
|
Yes.
That should work, let's give it a try. |
Just catching up on all this. I have a couple questions about this: Single table in SpatialDataFrom #12 (comment)
It seems like we could move to SpatialData having a single table. E.g. one SpatialData object refers to one set of observations. This was my initial understanding of what the SpatialData object would do, since it fits with the analysis capabilities of scverse tools. Can we agree on this scope? MuData in SpatialDataI think it would be useful to have a multimodal SpatialData object. This is for use cases like the ones we talked about at EMBL, where you have created a segmentation based on one modality, then applied it to another. This seems like the case for most spatial transcriptomic experiments (e.g. fluorescence for segmentation, then counting transcripts within those masks).
I agree that there are kinds of spatial data analyses where having a common set of observations doesn't make sense. But I think there are plenty of multimodal operations where you do use the same set of observations. Would it make sense to have a SpatialData object that has a MuData for annotations specifically for the case where observations are shared across modalities? |
Yes, let's proceed this way (the latest commit from Giovanni from this evening has now one single table).
I am not convinced it would work. I think the best will be to open a branch and make some experiments. |
Some technical details of what I am using at the moment (particularly relevant to @giovp), some of these choices are driven by coding needs and need to be reviewed, and finally put into the design doc. Images
Points
Tables
|
yes! |
Woo! Amazing. 🚀 🚀 🚀 |
Current proposal 04-08-22
Image
NDArray
Labels
AnnData
NDArray
Points
AnnData
AnnData
Polygons
AnnData
Tables
AnnData
The text was updated successfully, but these errors were encountered: