Add comments to the UDP server #921

josecelano · 2024-06-28T10:12:20Z

After a dicussion with @da2ce7 I understood the reasons why he implemented the active requests buffer in the way it's implemented. This PR adds some clarifications.

My main concern was tasks starvation, which means if we always accept new requests (spawning new tasks) we could not give enough time to current active tasks to finish (a reasonable time).

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

josecelano · 2024-06-28T10:41:56Z

Starvation

Hi @da2ce7, I'm still not sure if we can't have a starvation issue with this implementation. It seems we are assuming that by just giving the task a second chance to finish with "yield", the task can finish, but in the end, resources are limited, and the tasks can have a lock. It could be waiting to get the lock to update the torrent repository. I have the impression that we can have scenarios where:

We continue spawning tasks, consuming more memory faster than what the server can handle.
We abort tasks without giving them a minimum time to be processed.

I have the impression that we should only reject new requests when the buffer is full. Maybe we can try to simulate that scenario with a shorter buffer.

LocalRb

@da2ce7 should not be possible to use a LocalRb instead of a StaticRb. If I'm not wrong, only one thread owns the ActiveRequests variable and it's not sent between threads.

josecelano · 2024-06-28T11:31:06Z

ACK a897de4

da2ce7 · 2024-06-28T14:51:20Z

src/servers/udp/server/request_buffer.rs

+                if !abort_handle.is_finished() {
+                    self.rb
+                        .try_push(abort_handle)
+                        .expect("it should remove at least one element.");


"it should have space for this new task"

da2ce7 · 2024-06-28T14:53:05Z

src/servers/udp/server/request_buffer.rs

-            if h.is_finished() {
-                finished += 1;
-                continue;
+            Err(abort_handle) => {


abort_handle, -> new_task

da2ce7 · 2024-06-28T14:54:22Z

src/servers/udp/server/request_buffer.rs

+                let mut finished: u64 = 0;
+                let mut unfinished_task = None;
+
+                for removed_abort_handle in self.rb.pop_iter() {


removed_abort_handle, -> old_task

src/servers/udp/server/request_buffer.rs

da2ce7 · 2024-06-28T15:08:56Z

@josecelano

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

In fact we can yield many times without aborting tasks if each yield leads to the completion of the old task.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

We start the process of removing the tasks after starting the new task, then we do not even insert the new task if it is already finished after making space for it.

josecelano · 2024-06-28T15:40:51Z

@josecelano

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

In fact we can yield many times without aborting tasks if each yield leads to the completion of the old task.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

We start the process of removing the tasks after starting the new task, then we do not even insert the new task if it is already finished after making space for it.

Hey @da2ce7 I think I was missing a critical point. Since we are only receiving new requests in a single thread there is no way to increase the number of request processor tasks more than 51 tasks. The socket buffer is doing the contention. New "raw" requests have to wait until a current active request finishes or aborts.

Regarding the other topic "having enough time to be completed" before the force the taste to abort, the scenario I'm worried is:

Request processors (spawned tasks to handle a single request) take 20 seconds to finish because all requests are announce requests that require locking the torrent repository, and for some reason (high-load also in the API, for example) they have to wait longer.
The current 50 active requests are unfinished, but it's normal to wait long.
Every new request is going to abort one of the pending requests not letting the whole system to advance.

In that case we are just switching old tasks with new tasks. In that scenario I would prefer to abort current tasks if the time we have to wait is longer that a reasonable response time for the client. I mean, it does not make sense to wait 120 seconds because the client will make a new request, but it makes sense to wait 2 seconds and reject all new raw requests. In general I would prefer the system to reply to fewer clients slower rather than not replying at all to any client. Am I missing something?

UPDATE

Maybe we can store in the buffer not only the abort handle but also a timestamp for when the task was added. We only remove the task after a timeout. If all 50 tasks are under that timeout we continue yielding.

da2ce7 · 2024-06-28T17:46:05Z

@josecelano I think that you are mixing up getting tasks throughput and anti-dos.

For the server to start thrashing (i.e. to not process any requests), any particular request will have to be yielded to 50 times.

If somebody is making lots of very-difficult to fulfill requests and filling the buffer so it thrashes; short-lived tasks created by a non-dossing user can still slip in and will get yielded to.

josecelano · 2024-06-28T19:08:45Z

@josecelano I think that you are mixing up getting tasks throughput and anti-dos.

What's the difference? I mean, I'm trying to analyse what would happen on a high load scenario whether it's unintended or an attack. Only having a lot of announce requests that are computing to acces the torrent repository with other services.

For the server to start thrashing (i.e. to not process any requests), any particular request will have to be yielded to 50 times.

Sorry, I haven't realized that because I was not assuming we are popping the eldest item. Of course that's the expected behaviour of the ring buffer but the documentation does not make explicit for this method and since there are other methods I was assuming this "pop" is somehow random.

On the other hand, I don't know if 50 thread context switches is enough or not compare to what it takes to process a request. I guess you are assuming so.

If somebody is making lots of very-difficult to fulfill requests and filling the buffer so it thrashes; short-lived tasks created by a non-dossing user can still slip in and will get yielded to.

I suppose you assume the attacker uses a kind of heavy announce request that takes longer to process than the normal one. And I think you mean all tasks are going to have a chance to be executed and there is no way in this solution to avoid a bad actor. That's fine, I was not expecting to detect them and stop them. I was only trying to avoid wasting time just switching contexts.

If 50 thread context switches are enough to process a request in a reasonable response time in a server under pressure, then the solution is perfect. It's a kind of indirect timeout where we are not using a timestamp bit a context switches counter. I have the impression that the timeout should be bound to user expectations (what a reasonable response time is). This is more empirical, if we can't handle a request in 50 context switches then the system is under heavy load and we abort tasks. I think this would be the perfect solution if the reason for the server to underperform is because there are requests that are heavy to process.

But if the CPU load is high in general because there are a lot of requests in parallel that are waiting for a lock (for example) and 50 context switches is a relatively short time in comparison with a reasonable response time then it could lead to non replaying any client instead of sending responses with a longer response time.

Anyway, I think 🤔 I have to think deeper about it :-).

src/servers/udp/server/request_buffer.rs

da2ce7

Good Comments and Clarification for the Task Ring Buffer Implementation 👍

josecelano · 2024-07-01T09:01:29Z

ACK d1c2d15

codecov · 2024-07-01T09:15:01Z

Codecov Report

Attention: Patch coverage is 62.50000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 77.29%. Comparing base (d4e3208) to head (d1c2d15).

Files	Patch %	Lines
src/servers/udp/server/request_buffer.rs	59.09%	9 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##           develop     #921   +/-   ##
========================================
  Coverage    77.29%   77.29%           
========================================
  Files          183      183           
  Lines         9689     9689           
========================================
  Hits          7489     7489           
  Misses        2200     2200

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

josecelano had a problem deploying to coverage June 28, 2024 10:12 — with GitHub Actions Failure

josecelano requested a review from da2ce7 June 28, 2024 10:23

josecelano linked an issue Jun 28, 2024 that may be closed by this pull request

UDP server: alternative implementation to avoid aborting too many requests #918

Closed

josecelano marked this pull request as ready for review June 28, 2024 10:30

da2ce7 reviewed Jun 28, 2024

View reviewed changes

src/servers/udp/server/request_buffer.rs Outdated Show resolved Hide resolved

josecelano had a problem deploying to coverage June 28, 2024 15:43 — with GitHub Actions Failure

da2ce7 reviewed Jun 28, 2024

View reviewed changes

src/servers/udp/server/request_buffer.rs Outdated Show resolved Hide resolved

josecelano force-pushed the 918-udp-server-alternative-implementation-to-avoid-aborting-too-many-requests branch from edbf0d4 to 2c3edaa Compare June 28, 2024 20:23

josecelano had a problem deploying to coverage June 28, 2024 20:23 — with GitHub Actions Failure

da2ce7 approved these changes Jun 29, 2024

View reviewed changes

josecelano assigned da2ce7 Jun 29, 2024

josecelano added the Needs Rebase Base Branch has Incompatibilities label Jul 1, 2024

josecelano and others added 2 commits July 1, 2024 10:00

docs: [torrust#918] add comments to the UDP server

6495a4c

fix: [torrust#918] revision for UDP active reqeust buffer comments

d1c2d15

josecelano force-pushed the 918-udp-server-alternative-implementation-to-avoid-aborting-too-many-requests branch from 2c3edaa to d1c2d15 Compare July 1, 2024 09:01

josecelano temporarily deployed to coverage July 1, 2024 09:01 — with GitHub Actions Inactive

josecelano removed the Needs Rebase Base Branch has Incompatibilities label Jul 1, 2024

josecelano merged commit 5f1fdbd into torrust:develop Jul 1, 2024
16 of 17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comments to the UDP server #921

Add comments to the UDP server #921

josecelano commented Jun 28, 2024 •

edited

Loading

josecelano commented Jun 28, 2024

josecelano commented Jun 28, 2024

da2ce7 Jun 28, 2024 •

edited

Loading

da2ce7 Jun 28, 2024

da2ce7 Jun 28, 2024

da2ce7 commented Jun 28, 2024

josecelano commented Jun 28, 2024 •

edited

Loading

da2ce7 commented Jun 28, 2024

josecelano commented Jun 28, 2024

da2ce7 left a comment

josecelano commented Jul 1, 2024

codecov bot commented Jul 1, 2024

Add comments to the UDP server #921

Add comments to the UDP server #921

Conversation

josecelano commented Jun 28, 2024 • edited Loading

josecelano commented Jun 28, 2024

Starvation

LocalRb

josecelano commented Jun 28, 2024

da2ce7 Jun 28, 2024 • edited Loading

Choose a reason for hiding this comment

da2ce7 Jun 28, 2024

Choose a reason for hiding this comment

da2ce7 Jun 28, 2024

Choose a reason for hiding this comment

da2ce7 commented Jun 28, 2024

josecelano commented Jun 28, 2024 • edited Loading

da2ce7 commented Jun 28, 2024

josecelano commented Jun 28, 2024

da2ce7 left a comment

Choose a reason for hiding this comment

josecelano commented Jul 1, 2024

codecov bot commented Jul 1, 2024

Codecov Report

josecelano commented Jun 28, 2024 •

edited

Loading

da2ce7 Jun 28, 2024 •

edited

Loading

josecelano commented Jun 28, 2024 •

edited

Loading