Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improved logic for worker connections #196

Closed
3 tasks
nathangoulding opened this issue Jun 29, 2020 · 3 comments · Fixed by #216
Closed
3 tasks

Improved logic for worker connections #196

nathangoulding opened this issue Jun 29, 2020 · 3 comments · Fixed by #216
Assignees
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.

Comments

@nathangoulding
Copy link
Contributor

nathangoulding commented Jun 29, 2020

Description

There are a couple issues manifesting themselves which need resolution:

  • Workers are connecting to tink and attempting to sleep for 3 seconds between "do I have more work" checks. The sleep is buggy however, and results in thousands of lines of log messages being dumped due to it continually retrying. Solution: switch to a gRPC stream instead of sleep 3.
  • When fetching workflows, failed and completed workflows are being returned; tink should only return workflows that are in progress or pending.

Why is this needed

  • Incorrect sleep leading to log messages crashing user's terminal sessions, overflowing RAM / disk, etc.

Checklist:

I have:

  • updated the documentation and/or roadmap (if required)
  • added unit or e2e tests
  • provided instructions on how to upgrade
@mrmrcoleman mrmrcoleman added the priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. label Jul 7, 2020
@parauliya parauliya changed the title improved logic for worker connections Improved logic for worker connections Jul 8, 2020
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 13, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
@parauliya
Copy link
Contributor

Hi @nathangoulding,
There are few things I would like to discuss about :

  1. I think adding gRPC stream between tink-worker and tink-server is not a good solution because of the following reasons

           -   This will convert the protocol between worker and server from stateless to stateful which will add the condition based on stream connectivity.
           -   If there are multiple workers then server has to maintain sessions for each worker.
           -   This will complicate things between tink-worker and tink-server.
           -   Performance wise stream does not give any benefit.
    
  2. For this issue we can do few optimistion and fix the sleep so that terminal should not filled up with too many logs.

  3. When fetching workflow tink should only return workflows that are in progress or pending state and that too for that particular worker.
    This change is very much required at the server side.

  4. There is one more thing which is not there in this issue but we are planning for it which is as follows:
    As per the current implementation the tink-worker container on the worker machine exits as soon as it executes all the tasks assigned to it but we would like that container to keep running and polling to the server if there is a new workflow created for it so that it should start executing that wrokflow automatically. I would like to raise an issue for this enhancement and then do it after the fix of this issue.

Please let me know your views on the same.

@nathangoulding
Copy link
Contributor Author

nathangoulding commented Jul 13, 2020

Responses inline:

This will convert the protocol between worker and server from stateless to stateful which will add the condition based on stream connectivity.

If there is an established session it will use it; otherwise, it will wait for a worker to fetch it like it does today.

If there are multiple workers then server has to maintain sessions for each worker.

Correct, this is how it works with osie-runner and hegel in production today.

This will complicate things between tink-worker and tink-server.

I'm not sure I understand this justification.

Performance wise stream does not give any benefit.

The overhead that will be eliminated by converting from polling to a gRPC stream is:

  • Every polling request that returns no actions
  • TCP+SSL connection establishment and teardown

In addition, actions are started immediately instead of delayed until after the polling period.

@nathangoulding
Copy link
Contributor Author

Responses to the bullets inline:

For this issue we can do few optimistion and fix the sleep so that terminal should not filled up with too many logs.

We should convert to a gRPC stream, not use polling.

When fetching workflow tink should only return workflows that are in progress or pending state and that too for that particular worker. This change is very much required at the server side.

Yes, correct.

There is one more thing which is not there in this issue but we are planning for it which is as follows: As per the current implementation the tink-worker container on the worker machine exits as soon as it executes all the tasks assigned to it but we would like that container to keep running and polling to the server if there is a new workflow created for it so that it should start executing that wrokflow automatically. I would like to raise an issue for this enhancement and then do it after the fix of this issue.

I'm fine to break this issue into a separate task, i.e. tink-worker container adds a parameter that causes it to exit on completion, or wait for additional actions.

parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 16, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 17, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 17, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 17, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 17, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
gauravgahlot pushed a commit to infracloudio/tink that referenced this issue Jul 20, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 20, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 29, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya pushed a commit to infracloudio/tink that referenced this issue Jul 30, 2020
…r by server

There is a bit of change in worker as well:
1. Changed Info logs to Debug during sleep
2. Changed a bit of logic of starting the workflow action to execute
parauliya added a commit that referenced this issue Aug 4, 2020
Fix Issue #196: Filtered the workflow context sent to worker by server
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants