You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have now experienced several times, when I was able to parse iids, but then the url search would not return anything. I think I finally understand why.
Ok first lets establish two iid lists that work/don't work with get_urls_from_esgf:
Note
All of the intake-esgf parts below run from a PR Branch, which modifies the code to put out the file info without downloading any data. The details here do not matter much, what matters is that intake-esgf actually finds this information whereas pangeo-forge-esgf does not!
intake-esgf finds info for ALL of the iids in either set!
So what the heck am I doing wrong here? Digging into the code of intake-esgf more I am getting a suspicion:
The general pattern of intake-esgf is to do two sorts of queries to the ESGF REST API
A 'search' query which takes facets as input and then populates the catalog with facets and importantly an id field which is formatted as "<dataset_instance_id>|<data_node>".
A 'get dataset info' query which takes these 'id' values from above as input. Important here is that this query DOES NOT USE the full set of facets (it just uses 'variable', but if I read this correctly this is mainly to ensure compatibility with other collections not CMIP6?)
So this represents some sort of 'nested' query. If we try that approach with vanilla requests, we see that it works!
This is honestly pretty damn frustrating since nothing about this is mentioned in the API docs as far as I can tell. In fact they state that 'type' input defines which kind of 'record' (File or Dataset) you will get back and then show examples of faceted search here and say this:
The “type” facet must be always specified as part of any request to the ESGF search services, so that the appropriate records can be searched and returned. If not specified explicitly, the default value is type=Dataset .
All of this led me to believe that when I specify the identical set of facets and switch the 'type' I would get the matching set of files and iids depending on the value I provide. I guess I was wrong 😩.
The most disturbing thing is that some entries clearly work as I thought (otherwise I would have never gotten any results)...
Well at least I have a clue how to progress on this for now. Big thanks to @nocollier for all the work on intake-esgf. I would be curious where you learned that these 'nested' requests are needed to get all the data (I might just have missed something important).
I am fairly confident that with this knowledge I would be able to refactor large parts of pangeo-forge-esgf.
It might however be more practical to add a dependency to intake-esgf, even though the async request might still be a bit faster.
The text was updated successfully, but these errors were encountered:
jbusecke
changed the title
Investigation why intake-esgf has information about urls that
Investigation why intake-esgf has information about urls that we dont!
May 1, 2024
I have now experienced several times, when I was able to parse iids, but then the url search would not return anything. I think I finally understand why.
Ok first lets establish two iid lists that work/don't work with
get_urls_from_esgf
:Note
All of the intake-esgf parts below run from a PR Branch, which modifies the code to put out the file info without downloading any data. The details here do not matter much, what matters is that intake-esgf actually finds this information whereas pangeo-forge-esgf does not!
This confirms that we found NO info on any of the first set of iids, and info for all of the second set.
Now lets test this with intake-esgf:
intake-esgf finds info for ALL of the iids in either set!
So what the heck am I doing wrong here? Digging into the code of intake-esgf more I am getting a suspicion:
The general pattern of intake-esgf is to do two sorts of queries to the ESGF REST API
id
field which is formatted as "<dataset_instance_id>|<data_node>".So this represents some sort of 'nested' query. If we try that approach with vanilla requests, we see that it works!
This is honestly pretty damn frustrating since nothing about this is mentioned in the API docs as far as I can tell. In fact they state that
'type'
input defines which kind of 'record' (File or Dataset) you will get back and then show examples of faceted search here and say this:All of this led me to believe that when I specify the identical set of facets and switch the
'type'
I would get the matching set of files and iids depending on the value I provide. I guess I was wrong 😩.The most disturbing thing is that some entries clearly work as I thought (otherwise I would have never gotten any results)...
Well at least I have a clue how to progress on this for now. Big thanks to @nocollier for all the work on intake-esgf. I would be curious where you learned that these 'nested' requests are needed to get all the data (I might just have missed something important).
I am fairly confident that with this knowledge I would be able to refactor large parts of pangeo-forge-esgf.
It might however be more practical to add a dependency to intake-esgf, even though the async request might still be a bit faster.
The text was updated successfully, but these errors were encountered: