finisher: Set cached_*_count fields #370

jpmckinney · 2022-06-08T19:04:35Z

Workaround:

SELECT COUNT(*) FROM release WHERE collection_id IN :collection_ids;
SELECT COUNT(*) FROM record WHERE collection_id IN :collection_ids;
SELECT COUNT(*) FROM compiled_release WHERE collection_id IN :collection_ids;

jpmckinney · 2022-06-08T19:13:25Z

@yolile I forget what we use these for. Are they just a quick "sense check" as to whether the collection succeeded? If so, we have other ways of determining that.

yolile · 2022-06-08T19:39:45Z

The use case is described in #183, but maybe we now have faster queries to calculate these numbers?

jpmckinney · 2022-06-08T20:28:01Z

I ran:

time psql "dbname=ocdskingfisherprocess user=jmckinney host=postgres-readonly.kingfisher.open-contracting.org sslmode=require" -c '\t' \
    -c 'SELECT COUNT(*), collection_id FROM compiled_release GROUP BY collection_id'

and it completes in 25s. Having run it once, it now completes in 5s (caching, I assume). I can do a JOIN (again, maybe caching is helping), and it returns in 7s.

time psql "dbname=ocdskingfisherprocess user=jmckinney host=postgres-readonly.kingfisher.open-contracting.org sslmode=require" -c '\t' \
    -c 'SELECT COUNT(*), collection_id FROM collection c INNER JOIN compiled_release cr ON c.id = cr.collection_id GROUP BY collection_id'

Note that in a real-world scenario, you'd probably limit by source_id, so the query time would be shorter.

At the time #183 was resolved, there was no collection_id on release, record, and compiled_release, so the multiple joins through intermediate tables were very slow (minutes, and the DB was smaller back then).

These days, I think analysts mostly use https://kingfisher-colab.readthedocs.io/en/latest/#ocdskingfishercolab.list_collections, which returns the cached counts for all 3 tables (release, record, compiled_release).

Having the cached counts offers efficiency in terms of:

1 query instead of 3
milliseconds instead of seconds

I don't think (2) is that important. Seconds is fine for an operation that is performed rarely, e.g. once per feedback report.

For (1), if the queries are being done by list_collections, then the user is not really concerned about the number of queries. However, I don't know if analysts are directly querying the collection table.

Furthermore, do analysts care about the numbers in cached_releases_count and cached_records_count? The new version has data_type which stores the type of input (releases or records); I don't know if the number is useful. cached_compiled_releases_count is the main indicator of the dataset's size for analysis purposes. If we only care about one of these numbers, then 1 query is all that is needed after all.

yolile · 2022-06-09T00:52:49Z

I don't know if the number is useful.

Having the number of releases and records or compiled releases is useful for a quick check if the publisher implements change history or not (eg, if number of releases = number of compiled releases them there is no change history)

jpmckinney · 2022-06-09T15:52:11Z

Okay, I guess we probably want to keep these. They are pretty easy to implement in the finisher worker.

fppenna · 2023-07-31T18:10:54Z

In addition to what Yohanna mentioned before, it would be useful to easily review whether the publisher is including the same number of processes in its records, releases, and bulk publications

jpmckinney · 2023-07-31T18:17:43Z

Indeed - for now you'll have to run the alternative queries.

jpmckinney added the feature Relating to loading data from the web API or CLI command label Jun 8, 2022

jpmckinney mentioned this issue Jun 8, 2022

Preserve existing behaviors in Django branch #274

Closed

22 tasks

jpmckinney added this to the Priority milestone Jun 8, 2022

jpmckinney modified the milestones: Internal deployment, Priority Jan 25, 2023

jpmckinney modified the milestones: Priority, Internal server May 27, 2023

jpmckinney mentioned this issue May 27, 2023

Rebuild Kingfisher server open-contracting/deploy#402

Closed

11 tasks

jpmckinney modified the milestones: Data support server, Priority Jun 14, 2023

jpmckinney closed this as completed in 9a741c5 Apr 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

finisher: Set cached_*_count fields #370

finisher: Set cached_*_count fields #370

jpmckinney commented Jun 8, 2022 •

edited

Loading

jpmckinney commented Jun 8, 2022

yolile commented Jun 8, 2022

jpmckinney commented Jun 8, 2022

yolile commented Jun 9, 2022

jpmckinney commented Jun 9, 2022

fppenna commented Jul 31, 2023

jpmckinney commented Jul 31, 2023

finisher: Set cached_*_count fields #370

finisher: Set cached_*_count fields #370

Comments

jpmckinney commented Jun 8, 2022 • edited Loading

jpmckinney commented Jun 8, 2022

yolile commented Jun 8, 2022

jpmckinney commented Jun 8, 2022

yolile commented Jun 9, 2022

jpmckinney commented Jun 9, 2022

fppenna commented Jul 31, 2023

jpmckinney commented Jul 31, 2023

jpmckinney commented Jun 8, 2022 •

edited

Loading