Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpatialData specs proposal #12

Closed
giovp opened this issue Aug 4, 2022 · 23 comments
Closed

SpatialData specs proposal #12

giovp opened this issue Aug 4, 2022 · 23 comments

Comments

@giovp
Copy link
Member

giovp commented Aug 4, 2022

Current proposal 04-08-22

Image

  • Annotation: No (MUST)
  • Type: NDArray
  • NGFF spec: Image
  • group: Image
  • coordinateTrasform: Yes (same as ngff)
  • Misc: None

Labels

  • Annotation: Yes (MAY)
    • Type: AnnData
    • NGFF spec: Tables
    • Group: Tables or Image
  • Type: NDArray
  • NGFF spec: Labels
  • group: Image
  • coordinateTrasform: Yes (same as ngff)
  • Misc: None

Points

  • Annotation: Yes (MAY)
    • Type: AnnData
    • NGFF spec: Tables
    • Group: separate group or same group
  • Type: AnnData
  • NGFF: Tables
  • coordinateTrasform: Yes (same as ngff)
  • Misc: it MAY contains add parameters (radius)

Polygons

  • Annotation: Yes (MAY)
    • Type: AnnData
    • NGFF spec: Tables
    • Group: separate group or same group
  • Type: ?
  • NGFF spec: ?
  • coordinateTrasform: Yes (same as ngff) ?
  • Misc: not implemented.

Tables

  • Annotation: No (MUST)
  • Type: AnnData
  • NGFF spec: Tables
  • coordinateTrasform: No (MUST)
  • Misc: it can annotate Labels, Polygons and Points
@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Aug 4, 2022

Hey @giovp . Thanks for this! Just to clarify, are these the "building block" classes/types that you are proposing?\

If so, this overall looks good to me. I have left a couple of minor comments below. As a follow up, I would be interested in seeing what the building block API might look like (i.e, what properties and methods these objects will have).

Tables

Annotation: No (MUST)
Type: AnnData
NGFF spec: Tables
coordinateTrasform: No (MUST)
Misc: it can annotate Labels, Polygons and Points

Do we want to explicitly have different types of tables based on what they are annotating (e.g., PointsTable, LabelsTable)? I think it makes sense to be explicit so one knows what the rows represent. Also, I think we will likely have this in the OME-NGFF spec.

Polygons

Annotation: Yes (MAY)
Type: AnnData
NGFF spec: Tables
Group: separate group or same group
Type: ?
NGFF spec: ?
coordinateTrasform: Yes (same as ngff) ?
Misc: not implemented.

May be of interest: example of storing polygons in an AnnData table

https://forum.image.sc/t/roi-annotations-and-tracking-information-in-ome-zarr/65975

@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Aug 5, 2022

What about something like this for the API of the building blocks to start? Based on #11 , it looks pretty similar to what @giovp and @LucaMarconato are thinking.

class BaseBuildingBlock:
    # store the actual data (e.g., array for image, coordinates for points, etc.)
    data: Any

    # the annotation table (if present)
    annotations: Optional[AnnData]

    # store the transform objects as a dictionary with keys (source_coordinate_space, destination_coordinate_space)
    transforms: Dict[Tuple[str, str], TransformObject]

    def to_ome_ngff(path: str) -> None:
        """method to write to ome-ngff zarr file - could be called to_zarr?"""

    @abc.abstractmethod
    def transform(new_coordinate_space: str, inplace: bool = False):
        """convenience function to transform the object to a new coordinate space"""

    @abc.abstractmethod
    @classmethod
    def from_path() -> BuildingBlock:
        """constructor for instantiating from paths (maybe split into local and remote?"""
        raise NotImplementedError

Things to be added:

  • method for spatial query
  • method for query based on annotations (e.g., return object with just the data where the annotations meet some criteria)

Annotations with data?

What do you all think about grouping the annotations and data together (e.g., label image + label table in the same building block object)? This makes a lot of things more convenient (e.g., querying the data based on annotation/feature values). However, then people who just want to process the tables will need to separate the tables from the building block object (e.g., [table for building_block.annotations in all_building_blocks]). It seems like we could make a convenience method on the SpatialData object to make that easy (e.g., labels_table property)?

To me it seems the convenience of grouping the annotations with the data seems worth it, as the user doesn't need to keep track of two objects and querying will be nicer. Thoughts?

@giovp
Copy link
Member Author

giovp commented Aug 5, 2022

Do we want to explicitly have different types of tables based on what they are annotating (e.g., PointsTable, LabelsTable)? I think it makes sense to be explicit so one knows what the rows represent. Also, I think we will likely have this in the OME-NGFF spec.

interesting, didn't think about this, I think it'd make sense potentially yes, although maybe also redundant? As in, if it's an annotation table, I feel it should have the same entries/attributes regardless whether it's a PointsTable or LabelsTable.

May be of interest: example of storing polygons in an AnnData table

https://forum.image.sc/t/roi-annotations-and-tracking-information-in-ome-zarr/65975

maybe I misunderstood but by looking at the repo, they are not storing polygons in anndata, but polygons features and the centroid positions. The polygons are stored as collections of csv files (couldn't open, I guess it's the vertices) in labels https://github.com/openssbd/bdz/tree/main/wt-N2-081015-01.ome.zarr/labels/masks/0

I think what we'd want is to store polygons directly in anndata, the same way that we store points. One way to do it is via awkward arrays or geopandas, the former might gets in anndata faster but I don't know what would be the support from ngff side. I believe it's best to wait what they want to do with polygons, maybe worth to ping on zulip to get general feeling of what directions do they want to take.

@giovp
Copy link
Member Author

giovp commented Aug 5, 2022

(e.g., querying the data based on annotation/feature values). However, then people who just want to process the tables will need to separate the tables from the building block object (e.g., [table for building_block.annotations in all_building_blocks]). It seems like we could make a convenience method on the SpatialData object to make that easy (e.g., labels_table property)?

great point, I think it makes sense of start discussing this indeed and get an early prototype working. I agree it might make sense. However, I'd also think that we might want to have a single anndata table for the object by default, as it might be easier to design an api around for cropping subsetting. I think it'd be nice to have something like this:

sdata = SpatialData(
    images = [image1, image2, image3],
    labels = [labels1, labels2],
    points = [points3],
    tables = table
)

sdata[sdata.obs.cluster == "celltypeZ"]
>>> SpatialData with:
>>>     images: [image1, image3]
>>>     labels: [labels1]
>>>     points: [points3]
>>>     table: n_obs X n_vars

sdata.images[["image1","image2s"]].crop(0,2000,0,2000)
>>> SpatialData with:
>>>     images: [image1, image2]
>>>     labels: [labels1, labels2]
>>>     points: []
>>>     table: n_obs X n_vars

it would be anyway possible if we store views of tables associated to labels. What do you think?

@LucaMarconato
Copy link
Member

I would enable views of tables associated to labels in separate SpatialData objects, either created on the fly and pointing to the same memory/file, but with the possibility to detach them and then merge the information back thanks to helper functions.

@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Aug 5, 2022

However, I'd also think that we might want to have a single anndata table for the object by default, as it might be easier to design an api around for cropping subsetting

Thanks for the feedback, @giovp . I'm not sure if I am understanding you correctly. Just to make sure I am on the same page, here is my current understanding of the classes:

  • SpatialData: this is a class that represents a dataset that may include multiple building blocks. Generally speaking, this object will map to a OME-NGFF file where the individual Image, Labels, etc. building blocks will be datasets in the OME-NGFF file. Just to be concrete, if you have a visium experiment, the SpatialData object would have an Image (H&E image) and Points (measured expression at each spot). The Points would have a table that has the gene expression for each spot.
  • Image, Labels, Points, are all "building blocks"

Is that correct?

If so, what does the "single table" represent? Would each Labels, Points, etc. each have their own table?

Similarly, what is sdata.obs referring to in your sdata[sdata.obs.cluster == "celltypeZ"] example?

@kevinyamauchi
Copy link
Collaborator

I would enable views of tables associated to labels in separate SpatialData objects, either created on the fly and pointing to the same memory/file, but with the possibility to detach them and then merge the information back thanks to helper functions.

I agree that when slicing/sampling the SpatialData object, a view should be created that is a reference to the original option (likely with an option to make a copy).

@giovp
Copy link
Member Author

giovp commented Aug 6, 2022

Thanks for the feedback, @giovp . I'm not sure if I am understanding you correctly. Just to make sure I am on the same page, here is my current understanding of the classes:

SpatialData: this is a class that represents a dataset that may include multiple building blocks. Generally speaking, this object will map to a OME-NGFF file where the individual Image, Labels, etc. building blocks will be datasets in the OME-NGFF file. Just to be concrete, if you have a visium experiment, the SpatialData object would have an Image (H&E image) and Points (measured expression at each spot). The Points would have a table that has the gene expression for each spot.
Image, Labels, Points, are all "building blocks"
Is that correct?

that is correct, I would describe the class the same way.

If so, what does the "single table" represent? Would each Labels, Points, etc. each have their own table?

potentially yes, but I'm thinking more and more that we might want to have a single, concatenated table to expose the table api at the higher level of the hierarchy for SpatialData.

Let's consider the case of 1 Visium and 1 merfish slide in the same spatial dataset, of the same biological specimen

sdata = SpatialData(
    images = [images_visium, images_merfish],
    labels = [labels_merfish],
    points = [points_visium],
    tables = [tables_visium, tables_merfish]
)

Now let's say that the analyst wants to subset the object by only selecting cells of cluster celltypeZ.
If we expose one single table, we can support slicing such as:

sdata[sdata.obs.cluster == "celltypeZ"]
>>> SpatialData with:
>>> images = [images_visium, images_merfish],
>>> labels = [labels_merfish],
>>> points = [points_visium],
>>> table: n_obs X n_vars # with obs only of `celltypeZ`

so in that case, obs would be the obs of the single table. This could be potentially be helpful as we could inherit a bunch of methods from anndata (e.g. concat) of which we'd have to extend to propagate the results according to the region and region_key instances to the rest of the building blocks.

e.g. if filter is on celltypeY, only present in merfish, then the result would be.

sdata[sdata.obs.cluster == "celltypeY"]
>>> SpatialData with:
>>> images = [images_merfish],
>>> labels = [labels_merfish],
>>> points = [],
>>> table: n_obs X n_vars # with obs only of `celltypeY`

I understand this can be a stretch, and that we can come up with many counter-examples where this would not be desirable. However, I feel one strength of SpatialData could really be to have an AnnData-like API with additional methods to operate on the images. I'd love to hear @ivirshup thoughts on this.

Anyway, just thoughts for now, maybe too early to discuss this ....


I agree that when slicing/sampling the SpatialData object, a view should be created that is a reference to the original option (likely with an option to make a copy).

agree as well, if we decide to group Labels and AnnotationTables for the same data in the same building block, we could assign a view by default to the attribute you described above

...
# the annotation table (if present)
    annotations: Optional[AnnData] # view
...

@kevinyamauchi
Copy link
Collaborator

kevinyamauchi commented Aug 7, 2022

Thanks for the explanation, @giovp !

querying

Anyway, just thoughts for now, maybe too early to discuss this ....

I don't think it's too early to discuss. I think now is the right time to discuss what operations we want to support!

However, I feel one strength of SpatialData could really be to have an AnnData-like API with additional methods to operate on the images. I'd love to hear @ivirshup thoughts on this.

I agree with this goal! I would state it in a slightly different way: "SpatialData provides simple methods for querying across datasets contained within a `SpatialData" object. I am not sure it will make sense for the API to be too "AnnData-like" as the data are not strictly tabular.

In your example, what would each row in SpatialData.table represent? Would it be either a row from the labels_merfish table or the points_visium table? If so, does that mean we will have columns with a bunch of nan when the concatenated tables have different columns? For example, points_visium will have a radius, but labels_merfish won't. Also, the columns in X will likely vary widely across modalities)? Will we have to also add columns to say the source data?

I don't think we need to expose a single table in order to allow sampling across the different datasets. Instead, I think we can make a sampling API that allows the user to request data that meets specific criteria (e.g., has an obs column value and is within a certain spatial bounding box). This is basically a query language (e.g., sql, graphql), which sounds big, but I think it can be well scoped (e.g., just queries based on spatial information and obs to start). For spatial queries, I think we can take inspiration from xarray and for table queries, anndata, pandas, etc.. Ideally, our building blocks and SpatialData objects will has the same query API.

Just spitballing, but perhaps something like this:

sdata = SpatialData(
    images = [images_visium, images_merfish],
    labels = [labels_merfish],
    points = [points_visium],
    tables = [tables_visium, tables_merfish]
)

my_spatial_data_query = {
    # add query by the spatial location
    "spatial:" {
        # get data within global coordinates x=0:10, x=5:15
        "coordinate_space": "global"
        "coordinates": {
            "x": [0, 10],
            "y": [5, 15]
        }
    },
    
    # add query by the building block annotations
    "annotations": {
       # get data where obs.cluster == "celltypeY"
        "obs": {
            "column": "cluster",
            "comparator: "equals",
            "value": "celltypeY"
        }
    }   
}

data_view = sdata.query(my_spatial_data_query )

Some advantages of this approach:

  • queries can be validated before running (might be nice for large data)
  • works well for mixing spatial and tabular queries
  • much more readable than a long pandas style slicing operation (e.g., table.loc[column_a == a & column_b == b, & column_c == c & ...])
  • queries are easily serialized. this is useful for saving and quickly recalling a big query.

Some disadvantages:

  • may be unfamiliar to people used to anndata/pandas style query. However, if this ends up being a huge deal, we can consider adding a pandas-style .loc and .iloc method that translates the anndata/pandas style query into our query.

table types

Do we want to explicitly have different types of tables based on what they are annotating (e.g., PointsTable, LabelsTable)? I think it makes sense to be explicit so one knows what the rows represent. Also, I think we will likely have this in the OME-NGFF spec.

interesting, didn't think about this, I think it'd make sense potentially yes, although maybe also redundant? As in, if it's an annotation table, I feel it should have the same entries/attributes regardless whether it's a PointsTable or LabelsTable.

The types of queries above is why I think we want different types for PointsTable, LabelTable, etc. Having a different type for specific tables allows us to specify where key information is stored (e.g., point coordinates should be stored in a specific obsm field). While more rigid, this will make querying much easier. If we want to allow for flexibility, we could consider allowing the user to define a schema which says where information important for querying is stored. I'm not sure we should start there though. Ideally we can build standards around a lot of this stuff...

@giovp
Copy link
Member Author

giovp commented Aug 7, 2022

thanks for the explanation @kevinyamauchi , I very much agree with your points but I think we could essentially support both style of query from the start.


querying

In your example, what would each row in SpatialData.table represent?

an observation (either cell or spot).

Would it be either a row from the labels_merfish table or the points_visium table? If so, does that mean we will have columns with a bunch of nan when the concatenated tables have different columns?

potentially yes, but I believe if the idea is to analyze them together (in the same spatialdata object) then they would be minimal (e.g. same cell types etc.). Mudata is also an option we could consider already early on.

For example, points_visium will have a radius, but labels_merfish won't.

I don't think this is a problem, since both of these data representations are "regions", they have the same metadata as defined by ngff. The radius would be an attribute of the points_visium but it would not concern the annotation table that stores genes and cell annotations.

Also, the columns in X will likely vary widely across modalities)?

Yes and we could have a mudata object for that. The storage would still be an anndata but the in-memory representation could be a mudata.

Will we have to also add columns to say the szource data?

sorry don't understand what you mean here. If you mean a column mentioning which regions does the table annotate, this is already part of the ngff proposal (regions, region_key and instance_key).

I don't think we need to expose a single table in order to allow sampling across the different datasets. Instead, I think we can make a sampling API that allows the user to request data that meets specific criteria (e.g., has an obs column value and is within a certain spatial bounding box). This is basically a query language (e.g., sql, graphql), which sounds big, but I think it can be well scoped (e.g., just queries based on spatial information and obs to start). For spatial queries, I think we can take inspiration from xarray and for table queries, anndata, pandas, etc.. Ideally, our building blocks and SpatialData objects will has the same query API.

I 100% agree on this, this type of query should be used for the image-level data and could be extend to operate on the table-level data.


table types

The types of queries above is why I think we want different types for PointsTable, LabelTable, etc. Having a different type for specific tables allows us to specify where key information is stored (e.g., point coordinates should be stored in a specific obsm field).

this is interesting, I was thinking that an even easier approach (discussed briefly with @LucaMarconato ) was to store e.g. visium Points as an anndata with shape (N,2) (also in X, no need to be in obsm) and the radius, coordinate transform etc in .zattr. Instead, gene expression, cluster annotation etc. that annotate Points would be stored as a Table (another AnnData) separately.

I think that even if we want to keep annotation table and "region" (labels, points) together (in the same class, as you proposed above), we could still keep the building block as simple as possible without the need to specify a subtype for the tables.


this is very interesting discussion and indeed we should probably come to a conclusion relatively early on to the focus and build the API. I think I can totally be convinced that we should be focusing on the type of query you describe for both spatial and tables early on (and maybe even only supporting that). Yet I would not dismiss the anndata-style query just yet as I don't see it as only ergonomics .

@LucaMarconato
Copy link
Member

LucaMarconato commented Aug 8, 2022

spatial queries

I like the idea, especially that you can serialize the queries.

building blocks

In your example, what would each row in SpatialData.table represent? Would it be either a row from the labels_merfish table or the points_visium table? If so, does that mean we will have columns with a bunch of nan when the concatenated tables have different columns? For example, points_visium will have a radius, but labels_merfish won't. Also, the columns in X will likely vary widely across modalities)? Will we have to also add columns to say the source data?

I'll also comment on this. There are some imprecisions above. The original idea was to have

SpatialData(
      table,
      images,
      points,
      regions,
  ) -> None:

But then we opted for a more explicit alternative, which expands regions in the types of regions we have:

SpatialData(
      table,
      images,
      points,
      circles,  # previously called shapes
      labels,  # previously called raster regions (i.e. tensor segmentation masks)
      polygons  # for the moment skipped, not implemented
  ) -> None:

From the storage point of view, points are circles are the same, except that they require different types of indices, so we are examining if we can simplify the structure above to:

SpatialData(
      table,
      images,
      points,
      labels,  # previously called raster regions (i.e. tensor segmentation masks)
      polygons  # for the moment skipped, not implemented
  ) -> None:

This differs from the two previous options because now we allow also single molecule data to be annotated with rows from table, while in the original design that we discussed in the Heidelberg hackathon, the points slot could no be annotated with a table, and it already contained inside all the information needed (i.e. gene of each single molecule location and eventual point spread function information).

I am fully convinced on the advanced having a merged table with information across multiple samples, when the type of spatial information for each sample is the same. I also think that representing the table as a muon object would make the object more flexible. But I think that muon should not be used to represent annotations of different modalities when the obs are not of the same type and refer to overlapping spatial regions. That will be technically possible, but I would discourage it. In fact the power of SpatialData relies on the independence of the building blocks and the possibility to merge and split them at user convenience. For complex datasets I would use either a collection of SpatialData objects, or even better the idea of the SpatialDataPool (renamed SpatialDataContainer), that allows to create on the fly the SpatialData objects that one needs.

To stay practical, since now we are coding the SpatialData class and not SpatialDataContainer, I would put a stress in coding methods that allow combining and separating building blocks.

@giovp
Copy link
Member Author

giovp commented Aug 9, 2022

SpatialData(
      table,
      images,
      points,
      labels,  # previously called raster regions (i.e. tensor segmentation masks)
      polygons  # for the moment skipped, not implemented
  ) -> None:

This differs from the two previous options because now we allow also single molecule data to be annotated with rows from table, while in the original design that we discussed in the Heidelberg hackathon, the points slot could no be annotated with a table, and it already contained inside all the information needed (i.e. gene of each single molecule location and eventual point spread function information).

This is also what I have in mind, thanks for the clarification @LucaMarconato !

Also @kevinyamauchi this way we could have:

  • Transcripts locations (MERFISH) coordinates saved in adata.X and gene names (and other metadata) saved in adata.obs. Unclear where to store spatial index yet.
  • Visium spots coordinates saved in adata.X and radius and metadata in .zattr

And from storage side they are both Tables, yet with same coordinateTransform and metadata as Labels.
They could ofc also be annotated by a Table (a feature Table), which for Visium makes a lot of sense, a bit less for MERFISH.

I am fully convinced on the advanced having a merged table with information across multiple samples, when the type of spatial information for each sample is the same. I also think that representing the table as a muon object would make the object more flexible.

👍

But I think that muon should not be used to represent annotations of different modalities when the obs are not of the same type and refer to overlapping spatial regions. That will be technically possible, but I would discourage it.

Could you elaborate on that? Don't think I fully get iut, isnt' this the purpose of mudata?

In fact the power of SpatialData relies on the independence of the building blocks and the possibility to merge and split them at user convenience. For complex datasets I would use either a collection of SpatialData objects, or even better the idea of the SpatialDataPool (renamed SpatialDataContainer), that allows to create on the fly the SpatialData objects that one needs.

To stay practical, since now we are coding the SpatialData class and not SpatialDataContainer, I would put a stress in coding methods that allow combining and separating building blocks.

Agree, another option would be to support the spatial query for both SpatialData and SpatialDataContainer and the anndata query only for SpatialData , thus SpatialData would require more assumptions on the dataset structure.

@LucaMarconato
Copy link
Member

Agree, another option would be to support the spatial query for both SpatialData and SpatialDataContainer and the anndata query only for SpatialData , thus SpatialData would require more assumptions on the dataset structure.
I think it should work nice, I like this option.

Could you elaborate on that? Don't think I fully get iut, isnt' this the purpose of mudata?
MuData is designed to handle the case of having, for example, 1000 cells, 900 of them have 3 modalities, 50 have only one modality, and 50 have only one modality but a different one. So basically it represents a multimodal readout for a set of instances, allowing for some incomplete data.

In the case of spatial data you can have instead very different layers that overlap in space but not in a unique way, for instance if you have the output of two segmentation algorithms you will have overlapping regions but not pixel perfect, or if you have consecutive slides you can heuristically match instances, but if you change heuristic/alignment the mapping will be different. So the mapping between different obs is

  1. Based on space
  2. Based on the particular alignment/segmentation heuristic the user chooses
    This goes against what MuData expects, that is that cells/obs are known immutable objects, and the only source of missing data/mismatch is that some of them have missing modalities.

Note that we can always create MuData objects on the fly after the instances are determined.

Another point is that in MuData the obs are all of the same type, like cells, while in our case we could have cells, regions, larger regions, etc. And to complicate the thing they can be all overlapping in space. So I think it would be more polished to make those tables live in different objects. A more appropriate data structure would not just allow multiple modalities (= different classes of vars), but would also allow different types of obs. @gtca is actually experimenting with this (maybe you could point to some code?), but this goes beyond MuData.

@gtca
Copy link
Collaborator

gtca commented Aug 9, 2022

Hey, I haven't updated myself on the discussion above but just in order to provide some context about the observation groups that @LucaMarconato mentioned, here's that.

I'm experimenting with having a grid-like data structure, which is pretty lean but generalises both AnnData and MuData.
The code is WIP in a private repo at the moment but the gist is using labelled axes (and getting rid of pandas indices), e.g. imagine an object data aware of those and storing references to NumPy / PyTorch / JAX arrays (or any object with 2 dimensions like tables / data frames):

# one scrna-seq dataset
data["pbmc3k","genes"]
# another multiome dataset
data["pbmc10k","genes"]
data["pbmc10k","peaks"]

That gives other benefits but this is not an AnnData-like API.

E.g. a table can be also defined across datasets:

data["pbmc3k+pbmc10k","qc"]

I am not sure this is the approach you need here but if you're curious about any details, let me know.

@giovp
Copy link
Member Author

giovp commented Aug 9, 2022

Agree, another option would be to support the spatial query for both SpatialData and SpatialDataContainer and the anndata query only for SpatialData , thus SpatialData would require more assumptions on the dataset structure.
I think it should work nice, I like this option.

Could you elaborate on that? Don't think I fully get iut, isnt' this the purpose of mudata?
MuData is designed to handle the case of having, for example, 1000 cells, 900 of them have 3 modalities, 50 have only one modality, and 50 have only one modality but a different one. So basically it represents a multimodal readout for a set of instances, allowing for some incomplete data.

In the case of spatial data you can have instead very different layers that overlap in space but not in a unique way, for instance if you have the output of two segmentation algorithms you will have overlapping regions but not pixel perfect, or if you have consecutive slides you can heuristically match instances, but if you change heuristic/alignment the mapping will be different. So the mapping between different obs is

  1. Based on space
  2. Based on the particular alignment/segmentation heuristic the user chooses
    This goes against what MuData expects, that is that cells/obs are known immutable objects, and the only source of missing data/mismatch is that some of them have missing modalities.

Note that we can always create MuData objects on the fly after the instances are determined.

Another point is that in MuData the obs are all of the same type, like cells, while in our case we could have cells, regions, larger regions, etc. And to complicate the thing they can be all overlapping in space. So I think it would be more polished to make those tables live in different objects. A more appropriate data structure would not just allow multiple modalities (= different classes of vars), but would also allow different types of obs. @gtca is actually experimenting with this (maybe you could point to some code?), but this goes beyond MuData.

thanks, this is very clear and very convincing. From this point of view then, mudata would really only be useful for "true" multimodal spatial data (e.g. visium with RNA + CITE-seq (or AIRR, see scverse/scirpy#354 )).

@giovp
Copy link
Member Author

giovp commented Aug 9, 2022

Dialing back on the anndata-like query, i.e. the ability to subset SpatialData like SpatialData[SpatialData.obs.cluster == "cluster"] (or see these examples #12 (comment) ) couple of points:

  • should we then support it from start on SpatialData but not SpatialDataContainer (assumptions should be made)?
  • Would something like this syntax makes senseSpatialData[SpatialData.obs.cluster == "cluster"], otherwise, if we want to support it, how would it look like?

@LucaMarconato
Copy link
Member

LucaMarconato commented Aug 9, 2022

should we then support it from start on SpatialData but not SpatialDataContainer (assumptions should be made)?

Yes.

Would something like this syntax makes senseSpatialData[SpatialData.obs.cluster == "cluster"], otherwise, if we want to support it, how would it look like?

That should work, let's give it a try.

@ivirshup
Copy link
Member

Just catching up on all this. I have a couple questions about this:

Single table in SpatialData

From #12 (comment)

I am fully convinced on the advanced having a merged table with information across multiple samples, when the type of spatial information for each sample is the same.

It seems like we could move to SpatialData having a single table. E.g. one SpatialData object refers to one set of observations.

This was my initial understanding of what the SpatialData object would do, since it fits with the analysis capabilities of scverse tools.

Can we agree on this scope?

MuData in SpatialData

I think it would be useful to have a multimodal SpatialData object. This is for use cases like the ones we talked about at EMBL, where you have created a segmentation based on one modality, then applied it to another. This seems like the case for most spatial transcriptomic experiments (e.g. fluorescence for segmentation, then counting transcripts within those masks).

In the case of spatial data you can have instead very different layers that overlap in space but not in a unique way

I agree that there are kinds of spatial data analyses where having a common set of observations doesn't make sense. But I think there are plenty of multimodal operations where you do use the same set of observations.

Would it make sense to have a SpatialData object that has a MuData for annotations specifically for the case where observations are shared across modalities?

@LucaMarconato
Copy link
Member

LucaMarconato commented Aug 16, 2022

Single table in spatial data

Yes, let's proceed this way (the latest commit from Giovanni from this evening has now one single table).

MuData in SpatialData

I am not convinced it would work. I think the best will be to open a branch and make some experiments.

@ivirshup ivirshup changed the title SpatialData specs SpatialData specs proposal Aug 24, 2022
@LucaMarconato
Copy link
Member

LucaMarconato commented Sep 6, 2022

Some technical details of what I am using at the moment (particularly relevant to @giovp), some of these choices are driven by coding needs and need to be reviewed, and finally put into the design doc.

Images

  • At the moment axes is not implemented so I am expecting one of the following three cases
    • [c, y, x], with c in {3, 4} -> treated as a RGB(A) image
    • [c, y, x], with c not in {3, 4} -> treated as a multi-channel image
    • [y, x] -> treated as a [1, y, x] image

Points

  • This is an AnnData object with shape always (n, 0), with n being the number of points.
  • The actual spatial information is on .obsm['spatial'].
  • This object does not contain gene expression/features annotation
    • Exception to the above is for cell-types for single-molecule data, then the cell-type information is stored in .obsm['spatial_type']. It's a bit of an inconsistency, we can discuss, but this is a consequence of unifying "circular regions" and "single-molecule points"
  • The radius is stored as an array (n points with identical radius require an array with shape (1000,)) and saved in .obsm['region_radius'].
  • The radius information could be stored in .zattrs instead.

Tables

  • This object does not contain spatial information, only annotation.
  • The information regarding the mapping to regions is a triplet regions, regions_key, instance_key (see hackathon hackmd or code for usage). I am saving this in .uns['mapping_info']. Probably this should be moved to .zattrs and parsed into the Table building block object.
    • At the moment I am not doing it because the SpatialData object is initialized from a collection of arrays and AnnData objects and not from a collection of BaseElements (spatial building blocks), so it is handier to keep that into the AnnData object.

@kevinyamauchi
Copy link
Collaborator

@giovp , can we close this issue now that the design doc and #52 has merged?

@giovp
Copy link
Member Author

giovp commented Dec 15, 2022

yes!

@kevinyamauchi
Copy link
Collaborator

Woo! Amazing. 🚀 🚀 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants