Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

finisher: Set cached_*_count fields #370

Closed
jpmckinney opened this issue Jun 8, 2022 · 7 comments
Closed

finisher: Set cached_*_count fields #370

jpmckinney opened this issue Jun 8, 2022 · 7 comments
Labels
feature Relating to loading data from the web API or CLI command
Milestone

Comments

@jpmckinney
Copy link
Member

jpmckinney commented Jun 8, 2022

Workaround:

SELECT COUNT(*) FROM release WHERE collection_id IN :collection_ids;
SELECT COUNT(*) FROM record WHERE collection_id IN :collection_ids;
SELECT COUNT(*) FROM compiled_release WHERE collection_id IN :collection_ids;
@jpmckinney jpmckinney added the feature Relating to loading data from the web API or CLI command label Jun 8, 2022
@jpmckinney jpmckinney added this to the Priority milestone Jun 8, 2022
@jpmckinney
Copy link
Member Author

@yolile I forget what we use these for. Are they just a quick "sense check" as to whether the collection succeeded? If so, we have other ways of determining that.

@yolile
Copy link
Member

yolile commented Jun 8, 2022

The use case is described in #183, but maybe we now have faster queries to calculate these numbers?

@jpmckinney
Copy link
Member Author

I ran:

time psql "dbname=ocdskingfisherprocess user=jmckinney host=postgres-readonly.kingfisher.open-contracting.org sslmode=require" -c '\t' \
    -c 'SELECT COUNT(*), collection_id FROM compiled_release GROUP BY collection_id'

and it completes in 25s. Having run it once, it now completes in 5s (caching, I assume). I can do a JOIN (again, maybe caching is helping), and it returns in 7s.

time psql "dbname=ocdskingfisherprocess user=jmckinney host=postgres-readonly.kingfisher.open-contracting.org sslmode=require" -c '\t' \
    -c 'SELECT COUNT(*), collection_id FROM collection c INNER JOIN compiled_release cr ON c.id = cr.collection_id GROUP BY collection_id'

Note that in a real-world scenario, you'd probably limit by source_id, so the query time would be shorter.

At the time #183 was resolved, there was no collection_id on release, record, and compiled_release, so the multiple joins through intermediate tables were very slow (minutes, and the DB was smaller back then).

These days, I think analysts mostly use https://kingfisher-colab.readthedocs.io/en/latest/#ocdskingfishercolab.list_collections, which returns the cached counts for all 3 tables (release, record, compiled_release).

Having the cached counts offers efficiency in terms of:

  1. 1 query instead of 3
  2. milliseconds instead of seconds

I don't think (2) is that important. Seconds is fine for an operation that is performed rarely, e.g. once per feedback report.

For (1), if the queries are being done by list_collections, then the user is not really concerned about the number of queries. However, I don't know if analysts are directly querying the collection table.

Furthermore, do analysts care about the numbers in cached_releases_count and cached_records_count? The new version has data_type which stores the type of input (releases or records); I don't know if the number is useful. cached_compiled_releases_count is the main indicator of the dataset's size for analysis purposes. If we only care about one of these numbers, then 1 query is all that is needed after all.

@yolile
Copy link
Member

yolile commented Jun 9, 2022

I don't know if the number is useful.

Having the number of releases and records or compiled releases is useful for a quick check if the publisher implements change history or not (eg, if number of releases = number of compiled releases them there is no change history)

@jpmckinney
Copy link
Member Author

Okay, I guess we probably want to keep these. They are pretty easy to implement in the finisher worker.

@fppenna
Copy link
Member

fppenna commented Jul 31, 2023

In addition to what Yohanna mentioned before, it would be useful to easily review whether the publisher is including the same number of processes in its records, releases, and bulk publications

@jpmckinney
Copy link
Member Author

Indeed - for now you'll have to run the alternative queries.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Relating to loading data from the web API or CLI command
Projects
None yet
Development

No branches or pull requests

3 participants