Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add comments to the UDP server #921

Conversation

josecelano
Copy link
Member

@josecelano josecelano commented Jun 28, 2024

After a dicussion with @da2ce7 I understood the reasons why he implemented the active requests buffer in the way it's implemented. This PR adds some clarifications.

My main concern was tasks starvation, which means if we always accept new requests (spawning new tasks) we could not give enough time to current active tasks to finish (a reasonable time).

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

@josecelano josecelano requested a review from da2ce7 June 28, 2024 10:23
@josecelano josecelano marked this pull request as ready for review June 28, 2024 10:30
@josecelano
Copy link
Member Author

Starvation

Hi @da2ce7, I'm still not sure if we can't have a starvation issue with this implementation. It seems we are assuming that by just giving the task a second chance to finish with "yield", the task can finish, but in the end, resources are limited, and the tasks can have a lock. It could be waiting to get the lock to update the torrent repository. I have the impression that we can have scenarios where:

  • We continue spawning tasks, consuming more memory faster than what the server can handle.
  • We abort tasks without giving them a minimum time to be processed.

I have the impression that we should only reject new requests when the buffer is full. Maybe we can try to simulate that scenario with a shorter buffer.

LocalRb

@da2ce7 should not be possible to use a LocalRb instead of a StaticRb. If I'm not wrong, only one thread owns the ActiveRequests variable and it's not sent between threads.

@josecelano
Copy link
Member Author

ACK a897de4

if !abort_handle.is_finished() {
self.rb
.try_push(abort_handle)
.expect("it should remove at least one element.");
Copy link
Contributor

@da2ce7 da2ce7 Jun 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"it should have space for this new task"

if h.is_finished() {
finished += 1;
continue;
Err(abort_handle) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

abort_handle, -> new_task

let mut finished: u64 = 0;
let mut unfinished_task = None;

for removed_abort_handle in self.rb.pop_iter() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed_abort_handle, -> old_task

@da2ce7
Copy link
Contributor

da2ce7 commented Jun 28, 2024

@josecelano

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

In fact we can yield many times without aborting tasks if each yield leads to the completion of the old task.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

We start the process of removing the tasks after starting the new task, then we do not even insert the new task if it is already finished after making space for it.

@josecelano
Copy link
Member Author

josecelano commented Jun 28, 2024

@josecelano

@da2ce7 clarify that the yieldinf tokio::task::yield_now().await; should give the task enough time to finish.

In fact we can yield many times without aborting tasks if each yield leads to the completion of the old task.

My second concern was why we immediately spawn new tasks for the incomming requests instead of waiting until we have place in the buffer. @da2ce7 replied that we are going to run that's tasks anyway because we force a place for the new tasks.

We start the process of removing the tasks after starting the new task, then we do not even insert the new task if it is already finished after making space for it.

Hey @da2ce7 I think I was missing a critical point. Since we are only receiving new requests in a single thread there is no way to increase the number of request processor tasks more than 51 tasks. The socket buffer is doing the contention. New "raw" requests have to wait until a current active request finishes or aborts.

Regarding the other topic "having enough time to be completed" before the force the taste to abort, the scenario I'm worried is:

  • Request processors (spawned tasks to handle a single request) take 20 seconds to finish because all requests are announce requests that require locking the torrent repository, and for some reason (high-load also in the API, for example) they have to wait longer.
  • The current 50 active requests are unfinished, but it's normal to wait long.
  • Every new request is going to abort one of the pending requests not letting the whole system to advance.

In that case we are just switching old tasks with new tasks. In that scenario I would prefer to abort current tasks if the time we have to wait is longer that a reasonable response time for the client. I mean, it does not make sense to wait 120 seconds because the client will make a new request, but it makes sense to wait 2 seconds and reject all new raw requests. In general I would prefer the system to reply to fewer clients slower rather than not replying at all to any client. Am I missing something?

UPDATE

Maybe we can store in the buffer not only the abort handle but also a timestamp for when the task was added. We only remove the task after a timeout. If all 50 tasks are under that timeout we continue yielding.

@da2ce7
Copy link
Contributor

da2ce7 commented Jun 28, 2024

@josecelano I think that you are mixing up getting tasks throughput and anti-dos.

For the server to start thrashing (i.e. to not process any requests), any particular request will have to be yielded to 50 times.

If somebody is making lots of very-difficult to fulfill requests and filling the buffer so it thrashes; short-lived tasks created by a non-dossing user can still slip in and will get yielded to.

@josecelano
Copy link
Member Author

@josecelano I think that you are mixing up getting tasks throughput and anti-dos.

What's the difference? I mean, I'm trying to analyse what would happen on a high load scenario whether it's unintended or an attack. Only having a lot of announce requests that are computing to acces the torrent repository with other services.

For the server to start thrashing (i.e. to not process any requests), any particular request will have to be yielded to 50 times.

Sorry, I haven't realized that because I was not assuming we are popping the eldest item. Of course that's the expected behaviour of the ring buffer but the documentation does not make explicit for this method and since there are other methods I was assuming this "pop" is somehow random.

On the other hand, I don't know if 50 thread context switches is enough or not compare to what it takes to process a request. I guess you are assuming so.

If somebody is making lots of very-difficult to fulfill requests and filling the buffer so it thrashes; short-lived tasks created by a non-dossing user can still slip in and will get yielded to.

I suppose you assume the attacker uses a kind of heavy announce request that takes longer to process than the normal one. And I think you mean all tasks are going to have a chance to be executed and there is no way in this solution to avoid a bad actor. That's fine, I was not expecting to detect them and stop them. I was only trying to avoid wasting time just switching contexts.

If 50 thread context switches are enough to process a request in a reasonable response time in a server under pressure, then the solution is perfect. It's a kind of indirect timeout where we are not using a timestamp bit a context switches counter. I have the impression that the timeout should be bound to user expectations (what a reasonable response time is). This is more empirical, if we can't handle a request in 50 context switches then the system is under heavy load and we abort tasks. I think this would be the perfect solution if the reason for the server to underperform is because there are requests that are heavy to process.

But if the CPU load is high in general because there are a lot of requests in parallel that are waiting for a lock (for example) and 50 context switches is a relatively short time in comparison with a reasonable response time then it could lead to non replaying any client instead of sending responses with a longer response time.

Anyway, I think 🤔 I have to think deeper about it :-).

@josecelano josecelano force-pushed the 918-udp-server-alternative-implementation-to-avoid-aborting-too-many-requests branch from edbf0d4 to 2c3edaa Compare June 28, 2024 20:23
Copy link
Contributor

@da2ce7 da2ce7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good Comments and Clarification for the Task Ring Buffer Implementation 👍

@josecelano josecelano added the Needs Rebase Base Branch has Incompatibilities label Jul 1, 2024
@josecelano josecelano force-pushed the 918-udp-server-alternative-implementation-to-avoid-aborting-too-many-requests branch from 2c3edaa to d1c2d15 Compare July 1, 2024 09:01
@josecelano josecelano removed the Needs Rebase Base Branch has Incompatibilities label Jul 1, 2024
@josecelano
Copy link
Member Author

ACK d1c2d15

Copy link

codecov bot commented Jul 1, 2024

Codecov Report

Attention: Patch coverage is 62.50000% with 9 lines in your changes missing coverage. Please review.

Project coverage is 77.29%. Comparing base (d4e3208) to head (d1c2d15).

Files Patch % Lines
src/servers/udp/server/request_buffer.rs 59.09% 9 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           develop     #921   +/-   ##
========================================
  Coverage    77.29%   77.29%           
========================================
  Files          183      183           
  Lines         9689     9689           
========================================
  Hits          7489     7489           
  Misses        2200     2200           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@josecelano josecelano merged commit 5f1fdbd into torrust:develop Jul 1, 2024
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

UDP server: alternative implementation to avoid aborting too many requests
2 participants