Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can not access/retrieve above offset 10.000 of Congressional Hearings packages #60

Closed
uwieske opened this issue Mar 28, 2020 · 3 comments

Comments

@uwieske
Copy link

uwieske commented Mar 28, 2020

Hi
The API response indicates that there are about 30.000 packages, but I am able to retrieve from offset 0 till 10000 packages. At offset 10000 I get an error. I have tried some arbitrary values above 10.000, but I get also an error when performing a GET on packages.
This is URL I am calling when the error occurs: https://api.govinfo.gov/collections/CHRG/2010-01-01T00:00:00Z?offset=10000&pageSize=100&api_key=<MY_APY_KEY>

@jonquandt
Copy link
Member

jonquandt commented Mar 29, 2020

@uwieske - 10,000 records is a current limitation of the collections service - See collection update . The collections request is based on the lastModified date of each package. Your request is looking for every CHRG that has been added or updated since 2010-01-01.

You can try to use a smaller time range for the request, or use our newly available published endpoint to retrieve a list of packages based on official publication date. See #25 for some additional detail on usage, or view the interactive documentation at https://api.govinfo.gov/docs

This provides similar, but enhanced, functionality as our sitemaps (CHRG index)

All 2010 CHRG packages
https://api.govinfo.gov/published/2010-01-01/2010-12-31?pageSize=100&offset=0&collection=CHRG&api_key=DEMO_KEY

Using this, you can get multiple years at a time.

I hope this is helpful. The majority of hearings are from after 1993, but we have been making a large corpus of select digitized hearing available as well - currently they go back to 1957.

@uwieske
Copy link
Author

uwieske commented Mar 30, 2020

@jonquandt Thank you for the thorough information. I completely missed the note. I ve just read it. So, that means that requests are capped given a date (corresponding to lastModified) and from that date max 10.000 records are retrieved. So if I think of a smart time interval (startdate-enddate) where the probability is high to stay in 10.000 records range, I could subsequently shift the interval as a time window over timeline and retrieve more than 10.000 records with almost no overlap?

I will try the other endpoint published too.
Thank you very much!

@jonquandt
Copy link
Member

@uwieske - yes, that seems reasonable, though I would suggest that published might be easier to have a predictable number of updates, since you could limit by official publication date (dateIssued).

The lastModified values for the packages are the driver for the collections endpoint startDate/endDate. Since the lastModified value is updated whenever a package is republished for a metadata edit, replacement version, or sometimes reindexing purposes, there may be a large number of updates across a given collection.

published has a modifiedSince parameter that will return only packages modified since that given time - similar to setting startDate in the collections endpoint, but further filtered by publication date..

e.g.
https://api.govinfo.gov/published/2019-01-01/2019-12-31?pageSize=100&offset=0&collection=CHRG&api_key=DEMO_KEY&modifiedSince=2020-03-01T00:00:00Z - 225 results

vs. no modifiedSince:
https://api.govinfo.gov/published/2019-01-01/2019-12-31?pageSize=100&offset=0&collection=CHRG&api_key=DEMO_KEY - 738 results

You could do something similar with the collections endpoint by limiting to a given congress - in this case:
https://api.govinfo.gov/collections/CHRG/2020-03-01T00:00:00Z?offset=0&pageSize=100&congress=116&api_key=DEMO_KEY -232 packages -- this includes 7 CHRG packages from 2020

So, either way will work -- just depends on your preference for calling.

I hope that helps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants