Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add job status endpoint to server #263

Merged
merged 5 commits into from
Jul 26, 2024
Merged

Add job status endpoint to server #263

merged 5 commits into from
Jul 26, 2024

Conversation

val500
Copy link
Contributor

@val500 val500 commented Apr 29, 2024

Description

This adds an endpoint to Testflinger server that receives status updates from Testflinger agents. These updates require a webhook that the server will forward the rest of the update message to.

Documentation

Documentation was added to job_schema.rst to reflect new job_status_webhook option in the job yaml. The server ReadMe was updated with details on the new status endpoint.

Web service API changes

  • Interface: [POST] /v1/status

  • Request Structure:

    {
        "agent_id": "<string>",
        "job_queue": "<string>",
        "job_status_webhook": "<URL as string>",
        "events": [
            {
            "event_name": "<string enum of events>",
            "timestamp": "<string>",
            "detail_msg": "<string>"
            },
            ...
         ]
    }
    
  • Response Status Codes:

    • HTTP 200 (OK)
    • HTTP 400 (Bad request) - The arguments could not be processed by the server
    • HTTP 504 (Gateway Timeout) - The webhook URL timed out
  • Rationale: This endpoint receives status updates from the agent if there is a webhook specified in the job yaml and posts the update to this webhook.

Tests

For the server, unit tests were added to test_v1.py. For the agents, unit tests were added to both test_client.py and test_agent.py to test both webhook posting behavior and overall status updating.

@val500 val500 requested a review from a team April 29, 2024 16:03
@mz2
Copy link
Collaborator

mz2 commented May 2, 2024

@val500 @plars please consider documentation in all implementation changes: if it's the case that documentation is missing on the topic affected altogether, consider it a part of the scope to address that too (so we amortise that implementation cost across PRs as opposed to expect some future documentation catch-up).

Copy link

@omar-selo omar-selo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Main concern is reliability, as in I wonder if we should add this to a queue rather than forward immediately

server/src/api/v1.py Outdated Show resolved Hide resolved
server/src/api/v1.py Outdated Show resolved Hide resolved
@val500 val500 force-pushed the agent_status_endpoint branch from 3db4ff7 to 570769d Compare May 3, 2024 20:43
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like you're missing the enums.py at the very least. Can you add that here or tell me where to find it if it's external?

@val500 val500 force-pushed the agent_status_endpoint branch from 570769d to 71cd870 Compare May 13, 2024 13:41
@val500 val500 marked this pull request as draft May 16, 2024 08:29
@val500 val500 force-pushed the agent_status_endpoint branch 3 times, most recently from 13d8725 to 3499736 Compare May 24, 2024 14:32
@val500 val500 marked this pull request as ready for review May 24, 2024 16:11
docs/reference/job-schema.rst Show resolved Hide resolved
common/README.rst Outdated Show resolved Hide resolved
agent/testflinger_agent/agent.py Show resolved Hide resolved
agent/testflinger_agent/client.py Outdated Show resolved Hide resolved
server/src/api/schemas.py Show resolved Hide resolved
agent/testflinger_agent/client.py Outdated Show resolved Hide resolved
agent/testflinger_agent/job.py Show resolved Hide resolved
agent/testflinger_agent/stop_condition_checkers.py Outdated Show resolved Hide resolved
server/src/api/v1.py Outdated Show resolved Hide resolved
@val500 val500 force-pushed the agent_status_endpoint branch from 3499736 to 00e7ce1 Compare June 4, 2024 20:01
@val500 val500 force-pushed the agent_status_endpoint branch 2 times, most recently from 343ea2a to ebd5f7c Compare June 7, 2024 19:52
@val500 val500 requested review from plars and omar-selo June 7, 2024 19:55
@val500 val500 force-pushed the agent_status_endpoint branch from ebd5f7c to f22ca7b Compare June 7, 2024 20:08
omar-selo
omar-selo previously approved these changes Jun 10, 2024
Copy link

@omar-selo omar-selo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did have a few comments left, but to avoid blocking anything I'll approve. As I think this is a good first step

Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a few more comments below. I think it would be nice to add a bit more on the unit tests, especially on the possible error/failure conditions. And I would definitely recommend and end-to-end test of this now that you have the test-observer side prototyped.

agent/testflinger_agent/agent.py Outdated Show resolved Hide resolved
agent/testflinger_agent/agent.py Outdated Show resolved Hide resolved
agent/testflinger_agent/client.py Outdated Show resolved Hide resolved
agent/testflinger_agent/event_emitter.py Outdated Show resolved Hide resolved
agent/testflinger_agent/runner.py Outdated Show resolved Hide resolved
server/src/api/v1.py Outdated Show resolved Hide resolved
@omar-selo
Copy link

omar-selo commented Jun 11, 2024

It occurred to me that it would be useful for TO if TF informs it about the last event. As in TF can mark an event as last to inform listeners that there will be no more events emitted after it. That way TO can say that this test execution has ended. Otherwise TO will need to know in detail which events would be final events. I prefer if TO doesn't actually know what the events mean for now. wdyt @plars @val500 ?

Update:
Just had a meeting with Andrej and Matias about these events. Here are some notes:

  • Would be great to have traceability, such that we can get to a job from TO through the events. This could be in the form of a url.
  • TO needs more information about the events, namely it needs to know if an event is the last event, and if an event is a failure/cancellation. We don't want to deduce these on TO by the event name because that couples TO to TF. Instead it would be useful if along with every event perhaps a couple of booleans are set, something like isFinal and isFailure.

wdyt @plars @val500

@val500
Copy link
Contributor Author

val500 commented Jun 11, 2024

It occurred to me that it would be useful for TO if TF informs it about the last event. As in TF can mark an event as last to inform listeners that there will be no more events emitted after it. That way TO can say that this test execution has ended. Otherwise TO will need to know in detail which events would be final events. I prefer if TO doesn't actually know what the events mean for now. wdyt @plars @val500 ?

Update: Just had a meeting with Andrej and Matias about these events. Here are some notes:

* Would be great to have traceability, such that we can get to a job from TO through the events. This could be in the form of a url.

* TO needs more information about the events, namely it needs to know if an event is the last event, and if an event is a failure/cancellation. We don't want to deduce these on TO by the event name because that couples TO to TF. Instead it would be useful if along with every event perhaps a couple of booleans are set, something like `isFinal` and `isFailure`.

wdyt @plars @val500

@omar-selo I think every job will end with the cleanup phase regardless of whether the job was cancelled or something threw an error, so the last event will always be ended_cleanup - @plars can confirm

@plars
Copy link
Collaborator

plars commented Jun 11, 2024

@omar-selo I think every job will end with the cleanup phase regardless of whether the job was cancelled or something threw an error, so the last event will always be ended_cleanup - @plars can confirm

That's true for testflinger specifically, but the idea is to make it more generic on both sides. We could set something like a boolean "IsFinal" but for us, it would always be on that cleanup event because that's the last thing that would happen, regardless of whether any part of the job failed.

However, that means that something like an event showing that the job timed out, cancelled, etc... would have already happened. If this is what you are really interested in, then perhaps we should mark those with an "IsTerminal" event for event types that will wind up terminating the whole job? would that be enough for you to pick those out as the reason for the overall job termination?

agent/testflinger_agent/agent.py Outdated Show resolved Hide resolved
agent/testflinger_agent/client.py Outdated Show resolved Hide resolved
@val500 val500 force-pushed the agent_status_endpoint branch 2 times, most recently from 07bd454 to b5e77c9 Compare June 17, 2024 15:54
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the updates! we should definitely go ahead and implement the things agreed to on the card unless you have further input on it. A few more comments, but looking better :)

common/testflinger_common/enums.py Outdated Show resolved Hide resolved
agent/testflinger_agent/agent.py Outdated Show resolved Hide resolved
agent/testflinger_agent/runner.py Outdated Show resolved Hide resolved
agent/testflinger_agent/runner.py Outdated Show resolved Hide resolved
@val500 val500 force-pushed the agent_status_endpoint branch from b5e77c9 to 7500809 Compare June 24, 2024 16:17
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meeting notes: need resource_url added - we discussed adding a JOB_START event at the beginning like the JOB_END event we have at the end. The detail section for that JOB_START event can contain the resource_url which would just be https://testflinger.canonical.com/jobs/<JOB_ID>. That isn't going to change throughout the job, so there's no need to send it with every single event.

common/testflinger_common/enums.py Outdated Show resolved Hide resolved
server/src/api/v1.py Outdated Show resolved Hide resolved
server/README.rst Outdated Show resolved Hide resolved
@val500 val500 force-pushed the agent_status_endpoint branch from 7500809 to c744181 Compare July 11, 2024 22:40
@val500 val500 force-pushed the agent_status_endpoint branch from c744181 to b074c27 Compare July 11, 2024 22:54
@val500 val500 force-pushed the agent_status_endpoint branch from 371a33b to 8a469fc Compare July 12, 2024 21:28
@val500 val500 requested a review from plars July 12, 2024 21:32
Copy link
Collaborator

@plars plars left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it should be good, and thanks for the video showing your testing of it against the TO changes on the other PR.

As discussed yesterday, please wait until the TO changes land so that we can land these PRs at the same time, and test them in staging together.

@val500 val500 merged commit c0f59b1 into main Jul 26, 2024
6 checks passed
@val500 val500 deleted the agent_status_endpoint branch July 26, 2024 18:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants