Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: API Change: k_work #28794

Closed
pabigot opened this issue Sep 29, 2020 · 6 comments
Closed

RFC: API Change: k_work #28794

pabigot opened this issue Sep 29, 2020 · 6 comments
Labels
Breaking API Change Breaking changes to stable, public APIs RFC Request For Comments: want input from the community

Comments

@pabigot
Copy link
Collaborator

pabigot commented Sep 29, 2020

On Hold

Pending maintainer response to #28579 (comment)

tl;dr This RFC addresses the requirement to notify the community in the case of changes to stable API that might require changes to existing code.

Problem Description

As identified in #27356 and the issues and comments it links to, Zephyr's k_work API has a variety of race conditions that prevent users from being able to use it to determine whether a work item is delayed, pending, in progress, or completed. This has resulted in a few bugs, and in a mistaken assumption that the results returned by API functions can be used to infer the work item state.

They cannot: work items by their nature proceed asynchronously based on operations initiated from threads and interrupts of various priorities, and by the time a return value is made available to a caller the state may have changed.

The goal is to revise the API slightly to conform to its documented behavior and original architecture. As a consequence the several bugs have been fixed and some behavior and return values changed.

Proposed change

k_delayed_work_cancel() will no longer remove an item submitted through k_delayed_work_submit_to_queue() if the item has progress to where it has been made pending (i.e. the delay has elapsed and k_work_submit_to_queue() has been invoked).

-EALREADY is no longer an expected return value from k_delayed_work_submit_to_queue().

-EALREADY is no longer an expected return value from k_delayed_work_cancel(). EINPROGRESS has been added as a return value. The interpretation of -EINVAL has changed.

In this case return values from k_delayed_work functions have changed along with behavior related to canceling delayed work items.

The goal is to make the implementation consistent with the API documentation, and to represent as accurately as possible the work item state, while reducing complexity and bugs.

Detailed RFC

The changes have been implemented in #28579.

The core behavior change is that k_delayed_work_cancel() will no longer remove an item submitted through k_delayed_work_submit_to_queue() if the item has progress to where it has been made pending (i.e. the delay has elapsed and k_work_submit_to_queue() has been invoked).

This path has been chosen because:

  • The API documentation has always indicated that cancellation was allowed only while the delay was pending;
  • There is no corresponding ability to cancel an item submitted by k_work_submit_to_queue(), making this feature specific to delayed work items.

Dependencies

Users of the non-delayed k_work API will see no change.

Most users of the k_delayed_work API will see no change.

Code that uses the return value of k_delayed_work_cancel() to infer whether a task had been scheduled but had not completed will need to rework the code to track task completion within the task state, including detecting that the task was completed externally prior to the work item being processed. #28579 demonstrates an alternative approach for the in-tree use of this idiom in the Bluetooth subsystem.

Concerns and Unresolved Questions

The ability to remove an item from a work queue was motivated by cases where a task needed to be completed synchronously and immediately. However a task may be in a state where it can't be cancelled. The remediation necessary to handle that case also handles the case where it could have been cancelled, reducing the value of cancellation. That remediation requires that the code providing the work item handler track the state of the work.

This concern is being discussed at #28579 (review) and other comments.

Alternatives Considered

A more complex use of atomic flags could be used to trace the work item's progress through the states identified in #27356, but doing so would be very complex and would not change the fundamental issue that API return values are stale representations of work item state.

@pabigot pabigot added RFC Request For Comments: want input from the community Breaking API Change Breaking changes to stable, public APIs labels Sep 29, 2020
@pdunaj
Copy link
Collaborator

pdunaj commented Sep 30, 2020

This change will be extremally troublesome.
The fact that k_delayed_work_cancel was able to both cancel countdown and unsubmit the work from the queue was very useful. The race is not the problem if all activities are performed from a workqueue thread context. Our application use that in multiple places as being able to synchronously unsubmit greatly simplifies design.

I will not argue with the fact that documentation was clear about it but you plan on taking away a useful feature. If keeping and documenting existing behavior is not an option can you add an API call that would perform cancel&ubsubmit mentioning that it requires to be performed in a right context?

@jukkar
Copy link
Member

jukkar commented Sep 30, 2020

Not sure if it belong to this PR, but the k_work_submit_to_queue() API is currently not returning anything which is problematic as it might not actually submit anything to queue if there is already a work pending. So the caller needs to call k_work_pending() to check whether the work was submitted or not. It would be more practical if the k_work_submit_to_queue() function would return this status to the caller.

@pabigot
Copy link
Collaborator Author

pabigot commented Sep 30, 2020

Not sure if it belong to this PR, but the k_work_submit_to_queue() API is currently not returning anything which is problematic as it might not actually submit anything to queue if there is already a work pending. So the caller needs to call k_work_pending() to check whether the work was submitted or not. It would be more practical if the k_work_submit_to_queue() function would return this status to the caller.

This doesn't make sense to me, but I think the misunderstanding is one I originally had too:

k_work_submit_to_queue() can't fail. If the work item is already submitted, it stays submitted. If it isn't already submitted, it will be submitted.

The whole point of the discussion in #22826 (review) is that k_work_pending()'s information is stale the instant it returns. Without accurate knowledge of relative thread priority and interrupt behavior the work item may be newly pending, or may have started being worked and no longer be pending, by the time the code tries to react to that setting.

@jukkar
Copy link
Member

jukkar commented Sep 30, 2020

k_work_submit_to_queue() can't fail. If the work item is already submitted, it stays submitted. If it isn't already submitted, it will be submitted.

Whether the work submit can fail or not was not the point I tried to make, it just would be practical for the caller to know if the work was submitted or not. The code path in the caller might be different according to this info. See example in subsys/net/ip/net_tc.c:net_tc_submit_to_tx_queue()

@pabigot
Copy link
Collaborator Author

pabigot commented Sep 30, 2020

Personally I like being able to use non-negative results to distinguish success states, so I'd be fine with having k_work_submit_to_queue() return 0 if the work item was newly submitted, and 1 if it had already been submitted and was still pending. That would need to go through an enhancement issue; it's not related to this.

I don't think it'd help you though. A return value from k_work_submit_to_queue() also can't tell you whether the work item is being processed, only that its pending bit hadn't been cleared.

net_tc_submit_to_tx_queue, is an example of the sort of potential race we have to deal with: k_work_pending() will return true if the work item hasn't been started yet, but it will return false if it has been started and is still running. In this case you've got local work queues that are known to use cooperative processing, so on a uniprocessor system if net_tc_submit_to_tx_queue() is invoked from cooperative thread context I think it's safe. If it's invoked from interrupt context, or on a multiprocessor, it might not be safe.

Note: I'm not saying the net code is wrong. I'm saying it's not obvious that it's correct.

In the general case you cannot determine that a work item has completed its task based on the return value of k_work_pending(). That information can only be reliably obtained by having the work handler mark completion in a parent object that owns the work item, or if you are absolutely certain of the contexts in which all work queue manipulation is done (and that those contexts will never change, e.g. by moving to an SMP platform).

@pabigot
Copy link
Collaborator Author

pabigot commented Oct 4, 2020

No longer relevant. The existing API will be retained with documentation explaining its limitations, and new API added that doesn't have those limitations.

See #28579 (comment)

@pabigot pabigot closed this as completed Oct 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Breaking API Change Breaking changes to stable, public APIs RFC Request For Comments: want input from the community
Projects
None yet
Development

No branches or pull requests

3 participants