Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add execution rate metering to the command engine #9356

Closed
landreev opened this issue Feb 1, 2023 · 10 comments · Fixed by #10211 or #10407
Closed

Add execution rate metering to the command engine #9356

landreev opened this issue Feb 1, 2023 · 10 comments · Fixed by #10211 or #10407
Assignees
Labels
D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting Size: 3 A percentage of a sprint. 2.1 hours.
Milestone

Comments

@landreev
Copy link
Contributor

landreev commented Feb 1, 2023

The proposed idea is to modify the command engine as to make it possible to keep accurate counts of executed commands per specific time intervals. We would want to count the totals for all commands, as well as specific commands (and, possibly, classes of commands?) as well as the counts and rates for individual users.

Note that this can be achieved without adding any new data structures (the values can be obtained by counting the entries in the ActionLogRecord table). However, that table is quite unwieldy on any busy installation, so we almost certainly want to add some efficient way of caching and updating the counts in the database.

Then we want to add a system of configurable rate limits for specific commands, in a way that would allow setting different limits for different users/groups of users, etc.
In addition to limits by the number of executed commands - for example, "certain number of UpdateDatasetVersionCommand per minute allowed for an otherwise unprivileged logged-in user" - Action Log records how much time it took to execute each command as well. So we may want to consider limiting use by that measure too (?). For example, if a specific user keeps making edits, but for whatever reason the UpdateDatasetVersionCommand on their dataset is especially cpu- and time-consuming - maybe that could trigger some red flags too.

A simplest use of this functionality would be to set a rate limit on, say, Get*Version commands for the anonymous (guest) users, and that would address a somewhat common case of a scripted crawler plowing through our holdings; past a certain call in too short a period of time the API starts giving them "too busy, try again later".

More care will need to be taken to make our pages communicate this "try again later" message to the UI users. We don't want anyone to lose a page-worth of edited metadata they are trying to save, etc.

@poikilotherm
Copy link
Contributor

poikilotherm commented Feb 2, 2023

How about Microprofile Metrics?

https://download.eclipse.org/microprofile/microprofile-metrics-5.0.0/microprofile-metrics-spec-5.0.0.html#_counted

@landreev
Copy link
Contributor Author

landreev commented Feb 2, 2023

(This was opened as a followup to a meeting with @siacus and @scolapasta, and as part of the "rate limiting logic" spike).

@landreev
Copy link
Contributor Author

landreev commented Feb 2, 2023

There's some interesting information in that ActionLogRecord table.

@landreev
Copy link
Contributor Author

landreev commented Feb 6, 2023

Some quick numbers extracted from the ActionLogRecords for the month of January 2023.
It needs to be pointed out first that the execution of the commands as recorded in this table must constitute only a small fraction of the overall system load. Specifically, the total number of seconds of execution times recorded for the month was ~890,000 i.e. something along the lines of 10 days of execution time - but that's from 2 servers with 16 CPUs each(!!). Granted, the CPUs are not working full time 100% of the time doing anything useful, obviously; so the true ratio is not exactly known. (But it may/should be possible to get that info from the Payara diagnostics).
The total of ~7.45M command executions have been recorded. Almost all of that (~7.44M) is from the execution of the top 7 commands (out of 52 different commands):

3090384 GetLatestPublishedDatasetVersionCommand
3003763 GetPrivateUrlCommand
1001020 CreateGuestbookResponseCommand
 120995 GetLatestAccessibleDatasetVersionCommand
 116116 GetDatasetCommand
  72352 DeleteDataFileCommand
  39837 UpdateDatasetVersionCommand

These numbers appear to be somewhat predictable; aside from the "get private url" one maybe (it may be executed somewhat unnecesserily by every load of the dataset page), although it does not appear to be super time-consuming.

When counting the execution time, almost all of it was spent executing these top 7:

UpdateDatasetVersionCommand 711323.074
DestroyDatasetCommand 51769.637
DeleteDataFileCommand 50938.873
CreateGuestbookResponseCommand 41263.791
FinalizeDatasetPublicationCommand 10254.528
PublishDatasetCommand 5841.523
GetPrivateUrlCommand 4356.787

Again, I'd say these numbers are fairly predictable/expected. Specifically, the part about the Update Dataset Command being the top hog. The 2 Delete commands that follow are somewhat expected too; an interesting data point for sure.
(will post more in separate comments)

@landreev
Copy link
Contributor Author

landreev commented Feb 6, 2023

An obvious quick check was to separately count the commands executed by the :guest, aka anonymous user. By sheer count, almost all of the commands are in fact executed by guest (~7.28M out of 7.45M). ~7.27M of those are these top 5 (out of only 12 different commands that are executed by guest):

3079215 GetLatestPublishedDatasetVersionCommand
2976447 GetPrivateUrlCommand
 985736 CreateGuestbookResponseCommand
 117295 GetLatestAccessibleDatasetVersionCommand
 113720 GetDatasetCommand

These however constitute only a small fraction of the total execution time. Again, a perfectly expected result, seeing how it's the update and delete commands that are the highest consumers. The total number of seconds spent executing anonymous commands was around 52K, i.e. under 15 hours total, full list below:

CreateGuestbookResponseCommand 41038.013
GetLatestPublishedDatasetVersionCommand 4167.891
GetPrivateUrlCommand 4097.658
GetDatasetCommand 1236.595
GetLatestAccessibleDatasetVersionCommand 827.454
ListVersionsCommand 31.144
GetSpecificPublishedDatasetVersionCommand 25.766
ListDataverseContentCommand 11.812
GetDataFileCommand 0.120
GetDataverseCommand 0.092
DestroyDatasetCommand 0.007
GetUserTracesCommand 0.000

Note that the bulk of the above is spent running the CreateGustebookResponse command; and that almost all of our downloads are in fact anonymous (but we already knew that).

(you're may be wondering what's up with that lone DestroyDasetCommand executed by the guest user - there's only one, yes - at the cost of 7 milliseconds. most likely, an unsuccessful attempt by an admin - possibly me - to destroy a dataset via the api, without properly supplying the api key on the first attempt)

@landreev
Copy link
Contributor Author

landreev commented Feb 6, 2023

Once again, let's be careful about interpreting these numbers. They do NOT by any means indicate that it doesn't cost us much to serve read-only content to anonymous users. Again, these are only the commands. There's plenty of evidence that the search API/solr becomes the performance bottleneck during aggressive robot crawls, and none of that heavy lifting goes through the command engine and gets recorded there - and that's only one example.

@mreekie mreekie moved this from Dataverse Team (Gustavo) to ▶SPRINT- NEEDS SIZING in IQSS Dataverse Project Feb 13, 2023
@mreekie mreekie added the D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting label Feb 28, 2023
@pdurbin pdurbin added the Size: 80 A percentage of a sprint. 56 hours. label Mar 9, 2023
@pdurbin
Copy link
Member

pdurbin commented Mar 9, 2023

Unclear what the objective is. Work on design. 80. Some actual coding can certainly happen. We can add an outline.

@mreekie mreekie moved this from SPRINT- NEEDS SIZING to SPRINT READY in IQSS Dataverse Project Mar 13, 2023
@cmbz cmbz added this to the 6.1 milestone Sep 13, 2023
@cmbz
Copy link

cmbz commented Sep 13, 2023

2023/09/13: Note, this issue is paired with another: #9409.

@cmbz cmbz modified the milestones: 6.1, 6.2 Nov 15, 2023
@stevenwinship stevenwinship self-assigned this Nov 27, 2023
@scolapasta scolapasta moved this from SPRINT READY to Clear of the Backlog in IQSS Dataverse Project Dec 6, 2023
@stevenwinship
Copy link
Contributor

If I understand this correctly there are two asks here.
First ask is to aggregate the data from the Action Log table into an aggregated table with a set time interval (1 hour). This could be code triggered by a cron job a few times a day. The aggregated data can be grouped/filtered by user and command. This data can then be used for various queries to report on metrics by user, command, or groups of commands. I do not know who needs this or how this will be used.

Second ask is for rate limiting. This feature does not need the metrics/aggregated data from the first task. This would only need to cache the command api calls, possibly using bucket4j or setting up a global redis cache. It would be nice to know if other cacheing or rate limiting will be needed in the future to help decide which caching system to use.

@cmbz cmbz moved this from Clear of the Backlog to In Progress 💻 in IQSS Dataverse Project Dec 14, 2023
@cmbz cmbz removed the Size: 80 A percentage of a sprint. 56 hours. label Jan 3, 2024
@cmbz cmbz added the Size: 3 A percentage of a sprint. 2.1 hours. label Jan 3, 2024
@cmbz
Copy link

cmbz commented Jan 3, 2024

2024/01/03: Resized during kickoff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
D: FixRateLimitingBehaviors Address this collection of issues that impact rate limiting Size: 3 A percentage of a sprint. 2.1 hours.
Projects
None yet
6 participants