-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explicit STAC Types in JSON [was] Ability to identify STAC file types with tools like file #889
Comments
@m-mohr Can you give a little more than just a thumbs down? The entire idea? unix file (which isn't the most flexible tool)? The specific suggestion? If you were to faced with a file of STAC json (possibly unsorted) and didn't have a STAC library, how would you distinguish the types? Some use cases I'm thinking of: Use sends me some stac files to be added to Earth Engine. Or all stac found by crawling the internet. Or someone puts some stac files and data in a GCS bucket and we want to figure out what it is and what issues there might be. |
Sorry for the thumbs down without a comment. I started with the reaction and started to comment, but got distracted and forgot it then. There are a couple of suggestions in the best practices guide, which should make it a bit better. Does your catalog follow them? The only thing that I think could be improved there is to allow collection.json for collections instead of using catalog.json. |
Sorry, I will try to clarify. Hopefully this will be more obvious. I was not thinking of file names at all, but the json contents. e.g. I might just be getting a string blob of json without any file as an RPC, from some database, or who knows where. I will look more closely at best-practices.md, but those don't seem to be along the lines of what I was intending with this issue. Users are likely going to generate and pass around STAC files. The names of files may be wonky at best. Users are not likely to create a tree as described in Catalog Layout e.g. with Earth Engine, we have > 500 collections, so naming them collection.json means we suddenly need a tree structure with one file in each directory. Users pass us yaml entries right now that don't have context. When we switch to STAC, they won't be in place for a user contribution. They may sometimes be malformed or copied from other sources without modifying them to fit into our structure. Does that make more sense? e.g. Since json doesn't allow shebangs ( {
"stac_version": "1.0.0-beta.2",
"stac_type": "Item", or {
"stac_version": "1.0.0-beta.2",
"stac_type": "Collection", or {
"stac_version": "1.0.0-beta.2",
"stac_type": "Catalog", Then I could submit a regex to the unix file magic database that would pick out the values for stac_version and stac_type. Then
And it would be trivial for any language with a JSON reader to lookup the stac_type field from the JSON and know what they were dealing with. |
Note that the unix file command won't look very far into a file for the |
Interesting - I didn't know you could give hints like that to unix. I'm definitely +1 on a recommendation (though probably not a requirement) to put stac_version at the top. |
JSON Objects don't have an order, that's coming from the JSON spec. So as long as you don't write JSON files by hand, you can't really guarantee in which order the properties occur in the file or you need a custom JSON writer that somehow sorts by property name. Some writers may do that already, but that's nothing we can really ask for. For example, the Google Earth Engine catalogs (I'm reading 0.6.2) changed the order of the properties in the past quite often. So seems your writer can't guarantee an order. For the validation, it would be cleaner to have a type field, at the moment it's really a mess with reporting as we need to validate against all schemas and guess what could be most appropriate. We could use type:Feature for Items already and just add type:Catalog and type:Collection or just say: no type=Catalog (and Collection is a catalog). I'm a bit hesistant to put another field (which must be required to be useful for validation und thus is breaking) in the 1.0 spec.
Strange, most larger catalogs I've seen have folders?! Conclusion: I don't think we can really do what you are asking for. I'd suggest using a library such as PySTAC. |
Well, this is the beta period, so it's the last possible time we have to do something like this. I agree we should make sure it's really a win, and not consider it'll break things. But it's a pretty easy upgrade path. I like the idea of using type:Feature. |
So for validation I don't see a big win. I'd say for validation one could rely on the availibity of the type field in items and all other files would be validated against either collection or catalog. Of course it would be a bit more straightforward in validation to have type:Catalog and type:Collection, but I don't feel that it would be worth the breaking change. In the end, a catalog is a collection. What do @lossyrob and @jbants think? What other advantages would it have? |
Having a |
This is outdated with regards to ItemCollectionsMaybe we should come up with a best practice how to detect the files. It seems that you are using a different mechanism as I'm using in the Node Validator and there's likely more out there. I feel it should be the same across the board. Something like:
Would that also work for PySTAC, @lossyrob ? |
I've implemented this in the Node Validator, seems to work so far. |
Hi, what if "providers" occur as property ? According to the spec this would mean that the type is a collection and not a catalog ? would be good to allow a provider for catalog also (without having to add extent information which would make it a Catalog object)... |
Then it's likely that clients may not recognize it as it's undefined for catalogs. As it's the "spatio-temporal" asset catalog, collections expect related extents. Why would you not want to provide them? Catalogs are for grouping and the provider metadata schould be in the Collection or Item. |
Is not a problem. We can provide "extents" even though they may not be very meaningful. It looks to me that most "catalogs" will then actually become "collections", and very few "catalogs" (which are not also "Collections") will remain. Note that in DCAT, https://www.w3.org/TR/vocab-dcat-2/ the equivalent of provider can be attached to either a dcat:Catalog or dcat:Dataset and not only a dcat:Dataset. |
Maybe better to be discussed somewhere else (not really related to the issue), but could you guide me through your use case? It seems there's something different from how we thought it would be. Feel free to contact me through Gitter... |
One thing we could do here is add |
Extent is required in collections so I'm not sure an 'itemType' field is needed, it just seems like a redundant field. On the other if we want to add it to the API it makes sense here. Wish it wasn't camelCase, I hate the fact that we keep on adding fields that don't follow the rest of the STAC convention. Guess we should have aligned to OGC API earlier on. |
It doesn't add any benefit to the STAC spec itself, it's more an API thing. I think it would be okay if we say an API needs to add itemType if they implement the Features part of the API. It is only required there... |
Well, I was thinking it could be used to help tools / validators to distinguish between collections and catalogs. Since we'll likely use it on the API side anyways. But I'm also fine to just keep it as something API's add. |
How would that help in distinguishing between collections and catalogs? I don't really understand that... |
👋 Sorry to jump on this, but today I was asked how to differentiate Catalog/Collection/Item json files (ok it's @kylebarron who asked me), and I was surprised that I had to loop through the specs to see which keys/objects where in each one to tell what was the json file. I'm def +1 on something like |
I realized that I never really came out as +1 on this. I think I was hoping for a stronger 'yes' from people doing validation. But I think we mostly heard from people who had already written their logic to detect, so didn't see a need for it. But I think it makes sense to make it easy to figure out from one look what type of file you have. Note we did are in the process of adding a media type #851, but I think that only helps if the link to the file uses it. |
I'm also +1. I think being able to understand the type of input data without a heuristic would be helpful for applications using STAC. |
I'm more on the -1 side. While it could be helpful, I feel it's too late for it and also somehow tackled by the recent best practice update which says collections are called collection.json, catalogs are called catalog.json, everything else is an item. So if you follow the best practice, it's clear what is inside the file. Also, if you want to support old versions of STAC in tooling, you still need the (simple) heuristic mentioned above. From my side, I wouldn't use it in the node validator or STAC Browser. The heuristic can actually be simplified if you just support 0.8+, which is even be better backward compatible than stac_type and would not be much longer in code than an if-elseif-else for stac_type:
|
I'm nervous about having only the heuristic to fall back on -- relying on the heuristic means that adding any entity types that are
This is a difficult thing to enforce, and it's also not obvious that enforcement is possible/correct, since it's a best practice rather than a part of the spec. Also, the best practices doc is pretty long, and you'd have to know that that recommendation exists and lives in "Static and Dynamic Catalogs" in order to find it. For regular STAC producers / consumers, checking best practices might be pretty common, but if I'm firing up a python shell with pystac for the first time and mad with power STAC-ifying all the things, I'm probably not going to check to make sure I chose the correct file names. |
Do these features contain the stac_version, too?
I'm trying to understand the issue better. For the example with decoding catalogs/collections:
Sure, that's more an additional indicator.
AFAIK PySTAC generates catalogs that comply to the best practice, so that should be a non-issue for PySTAC. Not sure about other tooling though. |
Of course they do, but versioned heuristics just add an extra if-else to the existing heuristics.
This is already I think pretty difficult. In validation it's not so bad, since you can read the schema dynamically and the concern is only the validity with the spec. In Franklin / stac4s, right now, we only support one STAC version at a time. The reason for this is that we our first concern is validity in the context of the program. I don't have a good answer for this question. I have multi-STAC-version provably internally consistent programs filed away in my head under "hard problems" and I've made no progress any time I've tried to think about it other than "well if migration tooling were magic and perfect..."
I think the larger point is that relying on STACs to be compliant not only with a json schema but also a large best practice markdown doc is will lead to really brittle tooling. If someone follows the spec to the letter and tosses their resulting STAC through the validator only to have it rejected because they didn't give a file the right name, they're going to be pretty cranky. |
I'm working on an application intending to support STAC from any source, whether a static URL, from an API, JSON pasted in, etc. so I can't depend on file naming, and besides if it's a best practice and not part of the spec, I can't rely on it anyways.
Personally I'd implement the heuristic for legacy data, but prefer a
Agree 💯 % |
Some use cases off the top of my head:
Thanks all for having this discussion! |
Thanks for the discussion. Many good arguments. I don't agree with all of them, but from the particpation alone (most active topic) it seems there's a demand so we should discuss it on the next STAC meeting on Monday and get a decision before RC1 (assigned the milestone now), otherwise it's likely really too late. One thing about the itemType discussed above: We should clearly check what the use is in OGC API, but I guess it's better not to fiddle with it and just make our own, e.g. stac_type. |
+1 on discussing on monday. For those who have not joined before the call is at 8am pacific time, and everyone is welcome. Just ask here / gitter / email and we can add you. |
(added must-have and discussion needed labels so I can track this easier. That is not presupposing that the types are must have, just that we must discuss and decide this definitively before RC1) |
We discussed this on the call. The consensus seemed to be to add 'type' to catalog and collection, and Items already have type=Feature (from geojson). Thus to fully figure out it's a stac item it'll be a two step approach - type=feature and then look if there's a stac_version. This also makes it so every item out there doesn't need to add a new type field. But this will still help with catalog vs collection. Other feedback was to make this well documented, probably add a section to 'best practices' on how to distinguish stac items. |
When crawling a catalog, you normally get a good hint that you're looking at an |
PR is up - feedback much appreciated: #971 |
I just realized the solution above doesn't work: The catalog schema should succeed if I use it to validate a Collection as the Collection spec claims to inherit from the Catalog spec. It doesn't work though, as the Catalog JSON Schema would expect a |
Ok, we discussed this a call today. The latest thought is that we just remove the idea of 'inheritance' from the spec. We would no longer say 'Collection is a Catalog', and we just have type=Collection and type=Catalog. We don't try to make a JSON Schema where one comes from the other, or where we try to validate Collections as Catalogs. They are distinct things. Both are used in creating catalogs, and can be crawled interchangably by clients. But the clients need to know the expect both. We agreed that the core idea would be that Catalog and Collection share an abstract class, but there's no way to represent that, and it's not worth including that in the spec. Implementations can choose to model things however they want. If they want to model the relationship between the two as an inheritance, they can. Or they can do a mix-in, and they may want to do a mix-in with 'links' for items too, since that construct is shared as well. The small group on the phone did want to get some more feedback on this, if it feels like a reasonable path, and if there's anything we 'lose' by no longer saying that a collection is a catalog. We'd still want them to share all the same fields, it's mostly just changes in how we talk about it in the spec. Thoughts? In particular we'd love to hear from people with STAC libraries / reading STAC's - @lossyrob @emmanuelmathot @kylebarron @jisantuc |
Also include server implementers... Franklin, Staccato, arturo api,... :) |
That's a good point too, this would be a breaking change...albeit a relatively minor one since Items are not changed. I also don't have a strong preference on this. My main concern is do we lose anything by not having a Collection strictly be a Catalog anymore? Breaking that inheritance potentially makes things more flexible in the future, as it means Collections aren't tied to changed in Catalog anymore for future changes (not a very compelling reason since I hope we do not make such changes in the future). |
To me that's a great outcome. Decoupling collections and catalogs provides some nice future flexibility and also gets us out of the self-imposed constraints in this discussion. This also gets closer to the spec concerning itself with JSON data at rest and leaving technical details to implementers, which I think is a more correct separation of concerns.
I'm confused about what we gain from that strict relationship. We can still mix in the relevant shared JSON schema pieces, so I don't think that's a unique benefit of saying "every collection is a catalog." |
One concern could also be that the types will be used to weaken the Collection field requirements. People could implement a Collection alike structure with type = Catalog. It would validate and people could expect tooling to support it while Collection requirements are not met (e.g. missing license or so). It seems like it weakens the Collection spec. In contrast to @jisantuc, I'm actually more afraid of the decoupling as it may lead to divergence between the specs. At the moment the inheritance (seem to) simplify implementation as you can just decide to skip implementing Collections if you don't care. They are simply Catalogs itself and that was also the main reason to make them inherit, I think. By the way, why did we drop the regexp approach? It felt a bit weird, but it doesn't feel as weird as decoupling the specs to me. |
This weekend I wound up more sure about the
Has anyone done this? All of |
Yes, the mentioned interaction lead to my comment above as I realized it could weaken the collection spec by just choosing whatever they ley from Collection and then put the type = Catalog and say/think it's valid although those additional fields actually don't get validated (maybe what Tom assumed there?). (With the old STAC version used in the example, it would not have helped anyway.) I think most people actually implement Collections, but not sure about "internal" tooling. The reason given above was more me recalling the reason that we initially had to do inheritance. Collections were meant to gracefully work as Catalogs in the code that had not seen Collections as they were not specified yet. |
A small note - going through the spec we have it so Catalog requires at least one child or item link, while Collection doesn't. Does that break the idea that a Collection is a Catalog? We should be clear there. Also I'll try to respond more later, but I think I lean with James - the gitter interaction pushes me towards having the |
Currently in DotNetStac, Collection are implemented as inherited object from Catalog. I do not mind decoupling them but I would keep a common base structure translated in a abstract class.
|
Good point. Yes, it does. |
Probably not something the average user is going to care about, but in exploring STAC files on my local machine, I find it hard sometimes to know which type of STAC file I'm looking at. Or if I'm trying to find the catalog, collection, or item, it might be hard to know from the command line. e.g. from Earth Engine:
This is a bummer! It would be awesome if file was able to return
STAC
, the version, and one of {Catalog,Collection
,Item
}. There might be a better way, but this could possibly work for file:stac_version
be 1ststac_type
(or some similar name) be 2ndThat runs a bit counter to the idea of the general flexibility of json, which would also be a bummer. If the idea is loosened to just letting readers know the type (like if I can read the json in some other language that doesn't have a stac library), then adding something that gives the type might be useful.
Thoughts?
The text was updated successfully, but these errors were encountered: