Add execution rate metering to the command engine #9356

landreev · 2023-02-01T23:30:22Z

The proposed idea is to modify the command engine as to make it possible to keep accurate counts of executed commands per specific time intervals. We would want to count the totals for all commands, as well as specific commands (and, possibly, classes of commands?) as well as the counts and rates for individual users.

Note that this can be achieved without adding any new data structures (the values can be obtained by counting the entries in the ActionLogRecord table). However, that table is quite unwieldy on any busy installation, so we almost certainly want to add some efficient way of caching and updating the counts in the database.

Then we want to add a system of configurable rate limits for specific commands, in a way that would allow setting different limits for different users/groups of users, etc.
In addition to limits by the number of executed commands - for example, "certain number of UpdateDatasetVersionCommand per minute allowed for an otherwise unprivileged logged-in user" - Action Log records how much time it took to execute each command as well. So we may want to consider limiting use by that measure too (?). For example, if a specific user keeps making edits, but for whatever reason the UpdateDatasetVersionCommand on their dataset is especially cpu- and time-consuming - maybe that could trigger some red flags too.

A simplest use of this functionality would be to set a rate limit on, say, Get*Version commands for the anonymous (guest) users, and that would address a somewhat common case of a scripted crawler plowing through our holdings; past a certain call in too short a period of time the API starts giving them "too busy, try again later".

More care will need to be taken to make our pages communicate this "try again later" message to the UI users. We don't want anyone to lose a page-worth of edited metadata they are trying to save, etc.

poikilotherm · 2023-02-02T06:13:46Z

How about Microprofile Metrics?

https://download.eclipse.org/microprofile/microprofile-metrics-5.0.0/microprofile-metrics-spec-5.0.0.html#_counted

landreev · 2023-02-02T15:08:58Z

(This was opened as a followup to a meeting with @siacus and @scolapasta, and as part of the "rate limiting logic" spike).

landreev · 2023-02-02T22:12:44Z

There's some interesting information in that ActionLogRecord table.

landreev · 2023-02-06T16:02:26Z

Some quick numbers extracted from the ActionLogRecords for the month of January 2023.
It needs to be pointed out first that the execution of the commands as recorded in this table must constitute only a small fraction of the overall system load. Specifically, the total number of seconds of execution times recorded for the month was ~890,000 i.e. something along the lines of 10 days of execution time - but that's from 2 servers with 16 CPUs each(!!). Granted, the CPUs are not working full time 100% of the time doing anything useful, obviously; so the true ratio is not exactly known. (But it may/should be possible to get that info from the Payara diagnostics).
The total of ~7.45M command executions have been recorded. Almost all of that (~7.44M) is from the execution of the top 7 commands (out of 52 different commands):

3090384 GetLatestPublishedDatasetVersionCommand
3003763 GetPrivateUrlCommand
1001020 CreateGuestbookResponseCommand
 120995 GetLatestAccessibleDatasetVersionCommand
 116116 GetDatasetCommand
  72352 DeleteDataFileCommand
  39837 UpdateDatasetVersionCommand

These numbers appear to be somewhat predictable; aside from the "get private url" one maybe (it may be executed somewhat unnecesserily by every load of the dataset page), although it does not appear to be super time-consuming.

When counting the execution time, almost all of it was spent executing these top 7:

UpdateDatasetVersionCommand 711323.074
DestroyDatasetCommand 51769.637
DeleteDataFileCommand 50938.873
CreateGuestbookResponseCommand 41263.791
FinalizeDatasetPublicationCommand 10254.528
PublishDatasetCommand 5841.523
GetPrivateUrlCommand 4356.787

Again, I'd say these numbers are fairly predictable/expected. Specifically, the part about the Update Dataset Command being the top hog. The 2 Delete commands that follow are somewhat expected too; an interesting data point for sure.
(will post more in separate comments)

landreev · 2023-02-06T16:33:01Z

An obvious quick check was to separately count the commands executed by the :guest, aka anonymous user. By sheer count, almost all of the commands are in fact executed by guest (~7.28M out of 7.45M). ~7.27M of those are these top 5 (out of only 12 different commands that are executed by guest):

3079215 GetLatestPublishedDatasetVersionCommand
2976447 GetPrivateUrlCommand
 985736 CreateGuestbookResponseCommand
 117295 GetLatestAccessibleDatasetVersionCommand
 113720 GetDatasetCommand

These however constitute only a small fraction of the total execution time. Again, a perfectly expected result, seeing how it's the update and delete commands that are the highest consumers. The total number of seconds spent executing anonymous commands was around 52K, i.e. under 15 hours total, full list below:

CreateGuestbookResponseCommand 41038.013
GetLatestPublishedDatasetVersionCommand 4167.891
GetPrivateUrlCommand 4097.658
GetDatasetCommand 1236.595
GetLatestAccessibleDatasetVersionCommand 827.454
ListVersionsCommand 31.144
GetSpecificPublishedDatasetVersionCommand 25.766
ListDataverseContentCommand 11.812
GetDataFileCommand 0.120
GetDataverseCommand 0.092
DestroyDatasetCommand 0.007
GetUserTracesCommand 0.000

Note that the bulk of the above is spent running the CreateGustebookResponse command; and that almost all of our downloads are in fact anonymous (but we already knew that).

(you're may be wondering what's up with that lone DestroyDasetCommand executed by the guest user - there's only one, yes - at the cost of 7 milliseconds. most likely, an unsuccessful attempt by an admin - possibly me - to destroy a dataset via the api, without properly supplying the api key on the first attempt)

landreev · 2023-02-06T16:45:12Z

Once again, let's be careful about interpreting these numbers. They do NOT by any means indicate that it doesn't cost us much to serve read-only content to anonymous users. Again, these are only the commands. There's plenty of evidence that the search API/solr becomes the performance bottleneck during aggressive robot crawls, and none of that heavy lifting goes through the command engine and gets recorded there - and that's only one example.

pdurbin · 2023-03-09T16:39:15Z

Unclear what the objective is. Work on design. 80. Some actual coding can certainly happen. We can add an outline.

cmbz · 2023-09-13T19:36:23Z

2023/09/13: Note, this issue is paired with another: #9409.

stevenwinship · 2023-12-11T18:47:53Z

If I understand this correctly there are two asks here.
First ask is to aggregate the data from the Action Log table into an aggregated table with a set time interval (1 hour). This could be code triggered by a cron job a few times a day. The aggregated data can be grouped/filtered by user and command. This data can then be used for various queries to report on metrics by user, command, or groups of commands. I do not know who needs this or how this will be used.

Second ask is for rate limiting. This feature does not need the metrics/aggregated data from the first task. This would only need to cache the command api calls, possibly using bucket4j or setting up a global redis cache. It would be nice to know if other cacheing or rate limiting will be needed in the future to help decide which caching system to use.

cmbz · 2024-01-03T20:07:01Z

2024/01/03: Resized during kickoff

…server error; this is a quick fix, let me see if I can think of something better. #9356

…y with all the other commands. #9356

donsizemore mentioned this issue Feb 2, 2023

Feature Request/Idea: API / dashboard page to export and/or truncate actionlogrecord table #9357

Open

landreev mentioned this issue Feb 2, 2023

Spike: Revisit the issue of adding rate limiting logic to the application, create a list of actionable issues to start the effort. IQSS/dataverse-pm#23

Closed

scolapasta added this to IQSS Dataverse Project Feb 2, 2023

scolapasta moved this to Dataverse Team (Gustavo) in IQSS Dataverse Project Feb 2, 2023

mreekie moved this from Dataverse Team (Gustavo) to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023

mreekie added the D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting label Feb 28, 2023

pdurbin added the Size: 80 A percentage of a sprint. 56 hours. label Mar 9, 2023

mreekie moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Mar 13, 2023

cmbz added this to the 6.1 milestone Sep 13, 2023

cmbz modified the milestones: 6.1, 6.2 Nov 15, 2023

stevenwinship self-assigned this Nov 27, 2023

scolapasta moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Dec 6, 2023

cmbz moved this from Clear of the Backlog to In Progress 💻 in IQSS Dataverse Project Dec 14, 2023

cmbz removed the Size: 80 A percentage of a sprint. 56 hours. label Jan 3, 2024

cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Jan 3, 2024

stevenwinship mentioned this issue Jan 5, 2024

adding rate limiting for command engine #10211

Merged

stevenwinship removed this from IQSS Dataverse Project Jan 5, 2024

landreev added a commit that referenced this issue Mar 20, 2024

Returning formatting labels in that method is causing a 500/internal …

989eb94

…server error; this is a quick fix, let me see if I can think of something better. #9356

landreev added a commit that referenced this issue Mar 20, 2024

ok, this returns an actual HTTP 429 / TOO_MANY_REQUESTS. #9356

92e3650

landreev added a commit that referenced this issue Mar 20, 2024

Added the "Command" to the names of the new commands - for consistenc…

6a4552a

…y with all the other commands. #9356

landreev closed this as completed in #10211 Mar 20, 2024

landreev added a commit that referenced this issue Mar 21, 2024

The commits that didn't make it. #9356

a70d601

landreev mentioned this issue Mar 21, 2024

The commits that didn't make it. #9356 #10407

Merged

stevenwinship pushed a commit that referenced this issue Mar 21, 2024

The commits that didn't make it. #9356 (#10407)

a8aaa11

pdurbin added a commit that referenced this issue Mar 27, 2024

shorten heading, other tweaks #9356

117ee0f

pdurbin mentioned this issue Mar 27, 2024

Rate Limiting docs: shorten heading, other tweaks #10436

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add execution rate metering to the command engine #9356

Add execution rate metering to the command engine #9356

landreev commented Feb 1, 2023 •

edited

Loading

poikilotherm commented Feb 2, 2023 •

edited

Loading

landreev commented Feb 2, 2023

landreev commented Feb 2, 2023

landreev commented Feb 6, 2023 •

edited

Loading

landreev commented Feb 6, 2023 •

edited

Loading

landreev commented Feb 6, 2023 •

edited

Loading

pdurbin commented Mar 9, 2023

cmbz commented Sep 13, 2023

stevenwinship commented Dec 11, 2023

cmbz commented Jan 3, 2024

Add execution rate metering to the command engine #9356

Add execution rate metering to the command engine #9356

Comments

landreev commented Feb 1, 2023 • edited Loading

poikilotherm commented Feb 2, 2023 • edited Loading

landreev commented Feb 2, 2023

landreev commented Feb 2, 2023

landreev commented Feb 6, 2023 • edited Loading

landreev commented Feb 6, 2023 • edited Loading

landreev commented Feb 6, 2023 • edited Loading

pdurbin commented Mar 9, 2023

cmbz commented Sep 13, 2023

stevenwinship commented Dec 11, 2023

cmbz commented Jan 3, 2024

landreev commented Feb 1, 2023 •

edited

Loading

poikilotherm commented Feb 2, 2023 •

edited

Loading

landreev commented Feb 6, 2023 •

edited

Loading

landreev commented Feb 6, 2023 •

edited

Loading

landreev commented Feb 6, 2023 •

edited

Loading