Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nonblocking callback #459

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

erik-overdahl
Copy link

This PR makes it so that the ARA callback module no longer blocks the flow of Ansible playbook execution at all. Now the only blocking happens for the requests sent at the end of the playbook (this is unavoidable). This is accomplished by moving all logic for the callback to a separate thread and pushing received callback messages to the worker thread via a queue.

I ran some very simple benchmarks to confirm that this does in fact improve performance. I would like to have more robust measurements, but right now these are what I had time for.

All measurements were taken with 10 hosts performing 5 tasks across 10 forks. Times are reported in milliseconds and averaged over 5 runs.

First, the test used in the ARA benchmark article - each task is just debug

strat client db threads master nonblock diff
free 984
linear 1042
linear http sqlite 0 2628 2380 -248
free http sqlite 0 3588 3073 -515
free http postgres 4 4463 3888 -575
linear http postgres 0 4847 4267 -580
linear offline postgres 4 4892 4722 -170
free http postgres 0 7194 6244 -950
linear offline sqlite 0 8026 7426 -600
linear offline postgres 0 8262 7487 -775
free offline postgres 4 8771 8039 -732
free offline sqlite 0 11835 10850 -985
free offline postgres 0 11939 10771 -1168

The second test used the ansible.builtin.shell module to sleep for 1 - 3 seconds in order to mimic more realistic conditions

strat client db threads master nonblock diff
free 15641
free http postgres 4 15704 16150 446
free http sqlite 0 16221 14842 -1379
linear 18190
free offline postgres 4 18353 15907 -2446
free http postgres 0 18454 15572 -2882
linear http sqlite 0 19229 18668 -561
free offline sqlite 0 19590 17511 -2079
linear http postgres 4 19919 19405 -514
free offline postgres 0 20170 16072 -4098
linear http postgres 0 20595 19226 -1369
linear offline postgres 4 20849 19746 -1103
linear offline postgres 0 22578 20276 -2302
linear offline sqlite 0 23304 19960 -3344

The nonblocking client is faster than the current ARA master branch, and, importantly, makes a big performance difference in the situations where the previous threading-patch cannot help ( using sqlite as the database backend).

Move all callback operations into an auxiliary `AraWorker` class that
runs in a separate thread started by the `CallbackModule` class, which
is now just a shell that puts messages into a queue for the worker
thread to process. The callback plugin now does not block the main
Ansible thread.
@softwarefactory-project-zuul
Copy link

Build failed.

✔️ ara-tox-py3 SUCCESS in 4m 23s
✔️ ara-tox-linters SUCCESS in 4m 30s
ara-basic-ansible-core-devel FAILURE in 4m 10s (non-voting)
ara-basic-ansible-6 FAILURE in 4m 15s
ara-basic-ansible-core-2.14 FAILURE in 3m 56s
ara-basic-ansible-core-2.13 FAILURE in 4m 14s
ara-basic-ansible-core-2.12 FAILURE in 4m 08s
ara-basic-ansible-core-2.11 FAILURE in 4m 11s
ara-basic-ansible-2.9 FAILURE in 4m 30s
✔️ ara-container-images SUCCESS in 11m 45s

Running the thread as "daemon" makes the previous cleanup unecessary.
@softwarefactory-project-zuul
Copy link

Build failed.

✔️ ara-tox-py3 SUCCESS in 4m 01s
✔️ ara-tox-linters SUCCESS in 4m 01s
ara-basic-ansible-core-devel FAILURE in 2m 56s (non-voting)
ara-basic-ansible-6 FAILURE in 3m 21s
ara-basic-ansible-core-2.14 FAILURE in 2m 51s
ara-basic-ansible-core-2.13 FAILURE in 4m 44s
ara-basic-ansible-core-2.12 FAILURE in 2m 49s
ara-basic-ansible-core-2.11 FAILURE in 3m 16s
ara-basic-ansible-2.9 FAILURE in 4m 00s
✔️ ara-container-images SUCCESS in 11m 50s

Prevents program from exiting before all ARA work is finished.

(meant to commit this earlier - whoops!)
@dmsimard
Copy link
Contributor

dmsimard commented Nov 17, 2022

Hey @erik-overdahl and thanks for the PR !

I've looked at the CI failure and I think I know what's going on but it's a bit tricky.

This is the error:

TASK [Generate a failure that will be rescued] *********************************
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:76
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "Message from a failed task"
}

TASK [We can go on, it was a failure for test purposes] ************************
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:80
ok: [localhost] => {
    "msg": "👍"
}
2022-11-15 17:10:51,558 DEBUG ara.clients.http: HTTP 201: post on /api/v1/tasks
2022-11-15 17:10:51,561 DEBUG ara.plugins.callback.worker: v2_runner_on_start
2022-11-15 17:10:51,561 DEBUG ara.plugins.callback.worker: v2_runner_on_ok

TASK [Find the task with the ignored errors and validate the status] ***********
task path: /home/zuul-worker/src/github.com/ansible-community/ara/tests/integration/lookups.yaml:85
2022-11-15 17:10:51,615 DEBUG ara.clients.http: HTTP 200: get on /api/v1/tasks?name=ignore_errors&playbook=2
fatal: [localhost]: FAILED! => {
    "msg": "The conditional check '_task['results'][0]['name'] == 'Fail something with ignore_errors'' failed. The error was: error while evaluating conditional (_task['results'][0]['name'] == 'Fail something with ignore_errors'): list object has no element 0"
}

PLAY RECAP *********************************************************************
localhost                  : ok=9    changed=1    unreachable=0    failed=1    skipped=1    rescued=1    ignored=1   

It's these tasks:

- block:
- name: Fail something with ignore_errors
command: /usr/bin/false
ignore_errors: true
- name: Generate a failure that will be rescued
fail:
msg: Message from a failed task
rescue:
- name: We can go on, it was a failure for test purposes
debug:
msg: 👍
# failed is a new status for a task, validate that it actually works
- name: Find the task with the ignored errors and validate the status
vars:
_task: "{{ lookup('ara_api', '/api/v1/tasks?name=ignore_errors&playbook=' + playbook_id ) }}"
assert:
that:
- _task['results'][0]['name'] == 'Fail something with ignore_errors'
- _task['results'][0]['status'] == 'completed'

In other words, we are generating a failure for the sake of testing that we can query and search for it within the playbook.
This is meant to be integration testing coverage for ara itself and I guess it assumes that by the time we query for the failure, it should be there -- but since we're moving stuff around in threads, it could be possible that the failure will get posted later so it can't find it.

I don't have a good suggestion at this time but does my explanation make sense ?

Edit: I haven't got around to reviewing and testing the code yet, apologies for the delay

@erik-overdahl
Copy link
Author

That is my understanding as well. I tried adding a retry... until... delay to that assertion, but when I run the tests locally I am getting all sorts of weird "connection timeout" errors at random and inconsistent points in the test playbooks.

@hille721
Copy link
Contributor

Was reading your conversations with much interest. Where I'm a bit afraid of and I feel confirmed after the latest findings is that the reporting is no longer live. As long as the playbook runs to the end everything might be fine. But if the playbook run will be aborted or crashs due to any other reason then there is the possibility that later tasks are already performed before former tasks are fully reported to Ara.

Thus the question of this change is speed vs accuracy. For me speed is important but more important is accuracy. I prefer correct results in Ara then faster playbook runs. (btw that's why I tell everyone in my company to not use the free strategy, because this led to problems in Ara too).
Are the benchmark results really this good that we want to go for a more complex solution with maybe less accuracy?
Imo it would be also great to make such a think optional, meaning to support the traditional behavior and making this a feature which you can enable if performance is really important for you.

@erik-overdahl
Copy link
Author

I do not agree with your assessment.

All results will still be logged by ARA. A failing task "crashes" the playbook but the Ansible program underneath continues to run. The ARA worker added in this PR runs as a daemon thread, so the program does not exit until the thread is done with its work. So there is no problem with accuracy or missing results unless the Ansible program itself crashes, and if your playbooks are causing that to happen, you have bigger problems to deal with.

Additionally, in a more realistic Ansible scenario, tasks on hosts do not complete immediately like they do in our simple benchmark, giving ARA more time to complete the HTTP calls. In a normal situation, the reporting should line up closely to the actual run of the playbook. I attempted to reproduce this in my second set of benchmarks, and indeed you'll notice that the time differences between the runs with ARA and those without are much less severe than in the simple benchmark where all tasks complete immediately.

Refactoring the callback to be nonblocking addresses more than just performance. The synchronous nature of callbacks in the Ansible architecture means that, in its current form, ARA makes callbacks for things like timing the execution of tasks completely useless. Not only is this bad in itself, but it is also not transparent; people using those callbacks may enable ARA and still be relying on those timings to be accurate representations of the actual task executions times on the hosts, unaware of the slowdown coming from ARA.

Avoiding the free strategy because ARA cannot handle it is backwards thinking, imo. If there is a language feature that is perfectly fine by itself but you cannot use it because your logging system cannot handle it, then the problem is with the logging system. ARA should be implemented better. No one should even have to think about including this in their Ansible toolkit and they should not have to change their Ansible playbooks to accommodate ARA.

@hille721
Copy link
Contributor

unless the Ansible program itself crashes

that is exactly what I meant. Maybe a user kills it, or the controller (e.g. a VM) crashs due to any reason, must not be related to the playbook itself. But in such cases I want to have a propper Ara reporting as well.

The synchronous nature of callbacks in the Ansible architecture...

But wouldn't it better then if directly Ansible supports kind of async callbacks in general?

@erik-overdahl
Copy link
Author

erik-overdahl commented Nov 18, 2022

If the machine you are running Ansible on crashes, there isn't anything you can do, and current ARA also has the risk of losing results in this case. A task may finish, but if we are still waiting on the POST requests in the callback to finish when the crash happens, then you just lost all results you didn't send yet.

If any exception happens in the Ansible process, the exception is caught at the top level and the process shuts down gracefully - see the cli_executor method. Because the thread runs as a daemon, your outstanding ARA requests will complete before the Ansible process exits. Which, actually, I'm not sure is currently true for ARA? if you cancel the playbook before the request for a task has been sent and you aren't using threads for ARA, then the UserInterrupt error will be thrown and the program will exit and because your requests were happening on that main thread that just got interrupted, you lose that data. So if anything, this PR makes it less likely that you will lose data, not more.

EDIT: after running some tests, you are indeed correct that user cancelling will cause lost data when this patch is applied. I can fix this by bringing back the __del__ method on the CallbackModule lass that I removed in 6397d24. In this case, there is a hang after the user cancels the playbook, but all results are successfully recorded. It's up to you whether that hang is acceptable (we could put up a message to the user?). I haven't encountered lost data from ARA master in my tests, but I'm not sure if that's because it is impossible or if I am not running a test that will cause that problem to arise.

I agree that Ansible SHOULD implement the callback plugins in a non-blocking manner, but they don't, so it is up to the callback plugins themselves to not block.

Even if the user cancels the run.

Brings back the `__del__` method on CallbackModule removed in 6397d24
@softwarefactory-project-zuul
Copy link

Build failed.

✔️ ara-tox-py3 SUCCESS in 3m 49s
✔️ ara-tox-linters SUCCESS in 3m 33s
ara-basic-ansible-core-devel FAILURE in 3m 25s (non-voting)
ara-basic-ansible-6 FAILURE in 3m 30s
ara-basic-ansible-core-2.14 FAILURE in 3m 22s
ara-basic-ansible-core-2.13 FAILURE in 3m 33s
ara-basic-ansible-core-2.12 FAILURE in 3m 23s
ara-basic-ansible-core-2.11 FAILURE in 4m 41s
ara-basic-ansible-2.9 FAILURE in 3m 41s
✔️ ara-container-images SUCCESS in 12m 18s

@dmsimard
Copy link
Contributor

dmsimard commented Nov 19, 2022

unless the Ansible program itself crashes

that is exactly what I meant. Maybe a user kills it, or the controller (e.g. a VM) crashs due to any reason, must not be related to the playbook itself. But in such cases I want to have a propper Ara reporting as well.

If the machine you are running Ansible on crashes, there isn't anything you can do, and current ARA also has the risk of losing results in this case. A task may finish, but if we are still waiting on the POST requests in the callback to finish when the crash happens, then you just lost all results you didn't send yet.

Right.

I might describe ara's attempts to record data as "best effort", to some extent, due to the nature of how the callback interface works. ara's callback gets "woken up" when it receives a hook but if it stops receiving hooks, it can't do anything.
It's why interrupted playbooks and tasks may end up in the "running" status forever because the callback didn't receive the last hook, v2_playbook_on_stats.

There isn't much we can do about that, unfortunately.

I agree that Ansible SHOULD implement the callback plugins in a non-blocking manner, but they don't, so it is up to the callback plugins themselves to not block.

Over the years, ara has helped with finding issues in the Ansible callback interface and while some were fixed, others were closed because fixing them would be too complicated or break stuff.

One of those is about the blocking nature of the callback interface but that was more than 5 years ago now: ansible/ansible#27705

We could revisit that and ask about it upstream but the challenge is that any meaningful change would probably break compatibility and only be available in newer versions of ansible-core since it wouldn't be backported.

Anyway, before we settled on the current threading implementation, every request was synchronous.
This worked reliably well at a small scale but performed poorly at scale because, as you note, ansible would only continue running the playbook once ara's API requests had been sent and returned.

Now, there are still some amount of synchronous requests because, for example, to create a play we need a playbook ID and so we need to make sure the playbook has been created before we try to create the play.

For what it's worth, I went through the exercise of picturing a high level overview of how the sequence currently works from the perspective of the ara callback:
Selection_051

@dmsimard
Copy link
Contributor

Refactoring the callback to be nonblocking addresses more than just performance. The synchronous nature of callbacks in the Ansible architecture means that, in its current form, ARA makes callbacks for things like timing the execution of tasks completely useless. Not only is this bad in itself, but it is also not transparent; people using those callbacks may enable ARA and still be relying on those timings to be accurate representations of the actual task executions times on the hosts, unaware of the slowdown coming from ARA.

I think to call the current durations of tasks completely useless is a bit of a stretch.
Yes, they include the overhead of ara processing the task because ansible waits for the task to be finished -- but this overhead is typically negligible in the grand scheme of things when tasks can last several minutes long.

There are certain tradeoffs in ara for the sake of simplicity and sometimes it comes at the expense of performance.

Don't get me wrong, performance is important and I understand what you mean and how it would be more accurate to record the task duration as 3m 22s instead 3m 22.5s but that's neither useless or the end of the world.

Thus the question of this change is speed vs accuracy. For me speed is important but more important is accuracy. I prefer correct results in Ara then faster playbook runs. (btw that's why I tell everyone in my company to not use the free strategy, because this led to problems in Ara too).

 Avoiding the free strategy because ARA cannot handle it is backwards thinking, imo. If there is a language feature that is perfectly fine by itself but you cannot use it because your logging system cannot handle it, then the problem is with the logging system. ARA should be implemented better. No one should even have to think about including this in their Ansible toolkit and they should not have to change their Ansible playbooks to accommodate ARA.

ara should work reliably with the free strategy. It historically hasn't because I did not personally have a use case for it and so a lot of ara was built around assumptions of running synchronously with the serial strategy.

There are some fixes for the free strategy in 1.6.0 and if there are still issues, we should document and fix them.
I still don't personally have a use case for the free strategy so any help is welcome here.

@erik-overdahl
Copy link
Author

Yes, upon further reflection, I agree with you that ARA calls being reported in execution time is mostly a non-issue. I was thinking of it as reporting the execution time of a task on a host, but that is plainly not the case when the number of hosts is greater than ANSIBLE_FORKS.

I personally think that the performance gains from this patch are worth it, but I understand if you do not want to take on the added complexity. It isn't much more complexity, but it isn't none either. I can try to update that integration test to wait until the record is available, if you like (although I'm not sure of the best way to do this - you can't really retry until the record appears, because the test would never fail if the record wasn't actually there, so then your option is, what, a timeout? but that seems flaky, especially in a test, which you would hope is 100% deterministic).

Perhaps the most useful datapoint on ARA performance is this comment from the issue related to the benchmark blog post which claims that the 30% gains they saw from using Ansible with Mitogen were wiped out by ARA.

I would like to have a wider range of benchmarks. The numbers in the original ARA blogpost make the situation look a lot worse than it actually is. But I think it would be difficult to develop something useful; there are a lot of potential scenarios in a realistic workload.

Re: the free strategy - One of the things I found when I sprinkled logging statements throughout the callback code was that the ARA threading strategy ends up killing and creating a lot of ThreadPoolExecutors as the free strategy calls v2_playbook_on_task_start when it schedules a task on a host. In the ARA callback module, this triggers the _end_task method, as ARA assumes the linear strategy in which all runners for a task complete before the next task begins. Oddly, though... I don't know if this is actually an issue?? Look at the benchmarks in my first post; the free strategy with multiple threads still seems to perform pretty well.

@dmsimard
Copy link
Contributor

Hi @erik-overdahl,

I just wanted to say I haven't forgotten about your PR and that I'll get back to it when the time is right.
I believe the callback is due for a refactor but I don't know what that would look like in practice yet.

The callback has grown organically over the years: it has had features and fixes patched on along some reverse engineering of internal ansible-core's "API"s. It could be a good idea to take a step back, look at what it does and what we've learned over the years to come up with something a bit cleaner.

ara didn't always have an API and the callback used to make database queries directly with flask so a lot has changed since then :p

Performance, threading, queueing or asynchronicity can be part of an eventual refactor but I mean to say that a lot of what the callback does could benefit from being put in a library -- things like creating a playbook, a task, a host or a result, for example.

I want to have a better idea of what this looks like before going down the road of optimizing performance such that we don't step on each other's feet.

I appreciate your patience in the meantime !

@phemmer
Copy link
Contributor

phemmer commented May 6, 2023

If the risk of losing updates on crash is an issue, I think we can address those easily enough.

As @erik-overdahl mentioned we should be able to delay the shutdown until the queue drains. A typical approach I see in other applications (including my own) with similar concerns is that SIGINT/SIGTERM will initiate a graceful shutdown (wait for things to terminate cleanly, and in this case for the queue to drain), and a second SIGINT/SIGTERM causes an immediate exit.

We can also put in a max queue size. For example if we make the max size 0, then the request has to be passed synchronously to the thread, but once the thread has it ansible can continue and the thread would make the API call in the background. Any further attempts to push something onto the queue will block until the API call completes.
This can be customizable, allowing the user to specify how big that max size should be based on how much risk of lost data they're willing to accept. And a max size of -1 would disable async entirely (though might make it more intuitive to shift the values by +1, so 0 would disable async).

I also just want to add my own real world experience. The benchmarks in the original post were overly optimistic for my experience. My playbooks have lots of very short tasks that run across hundreds of hosts. I saw playbooks taking over 3x longer when ARA was enabled. A playbook that took about 7 minutes was taking almost 30 minutes. Though this was a while ago, and I cant remember details. I might have been using django auth, and might have been using sqlite, of which I now use neither, and am using the async wrapper I posted in #171 (comment).

@berendt
Copy link
Contributor

berendt commented Dec 4, 2023

I have tested the Async Wrapper. It does not work for us because it changes the timestamps in ARA (or in other words, these do not correspond to the actual time of execution). But I think the idea of the async wrapper (thanks for sharing @phemmer) is a good one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants