Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

netdev_event_thread: add support for event_thread irq handler #13573

Closed
wants to merge 5 commits into from

Conversation

jia200x
Copy link
Member

@jia200x jia200x commented Mar 6, 2020

Contribution description

As promised in #13562, this PR adds an optional event_thread based IRQ handler.

Now it's possible to write radio applications without an explicit recv thread for calling dev->driver->isr. Also, network stacks are not required to handle their radio events if this module is enabled, so the glue code only requires an implementation of event_callback. With some minor tricks we could easily expand the device support for LWIP and OpenThread.

As an example, the `tests/driver_at86rf2xx` looks like this:
/* ... */
static void _event_cb(netdev_t *dev, netdev_event_t event)
{
    switch (event) {
        case NETDEV_EVENT_RX_COMPLETE:
        {
            recv(dev);

            break;
        }
        default:
            puts("Unexpected event received");
            break;
    }
}

int main(void)
{
    puts("AT86RF2xx device driver test");

    unsigned dev_success = 0;
    for (unsigned i = 0; i < AT86RF2XX_NUM; i++) {
        netopt_enable_t en = NETOPT_ENABLE;
        const at86rf2xx_params_t *p = &at86rf2xx_params[i];
        netdev_t *dev = (netdev_t *)(&devs[i]);

        printf("Initializing AT86RF2xx radio at SPI_%d\n", p->spi);
        at86rf2xx_setup(&devs[i], p);

        /* Just call this and voilà! */
        netdev_event_thread_init(dev);

        dev->event_callback = _event_cb;
        if (dev->driver->init(dev) < 0) {
            continue;
        }
        dev->driver->set(dev, NETOPT_RX_END_IRQ, &en, sizeof(en));
        dev_success++;
    }

    if (!dev_success) {
        puts("No device could be initialized");
        return 1;
    }

    /* start the shell */
    puts("Initialization successful - starting the shell now");

    char line_buf[SHELL_DEFAULT_BUFSIZE];
    shell_run(shell_commands, line_buf, SHELL_DEFAULT_BUFSIZE);

    return 0;
}

Testing procedure

I used this patch to run tests/driver_at86rf2xx with 2 samr21-xpro boards:

Patch
diff --git a/tests/driver_at86rf2xx/Makefile b/tests/driver_at86rf2xx/Makefile
index 76daf84719..d4578e841a 100644
--- a/tests/driver_at86rf2xx/Makefile
+++ b/tests/driver_at86rf2xx/Makefile
@@ -6,6 +6,7 @@ USEMODULE += od
 USEMODULE += shell
 USEMODULE += shell_commands
 USEMODULE += ps
+USEMODULE += netdev_event_thread
 
 # define the driver to be used for selected boards
 ifneq (,$(filter samr21-xpro,$(BOARD)))
diff --git a/tests/driver_at86rf2xx/main.c b/tests/driver_at86rf2xx/main.c
index 7edab95bb5..9acd2754b3 100644
--- a/tests/driver_at86rf2xx/main.c
+++ b/tests/driver_at86rf2xx/main.c
@@ -31,9 +31,6 @@
 #define _STACKSIZE      (THREAD_STACKSIZE_DEFAULT + THREAD_EXTRA_STACKSIZE_PRINTF)
 #define MSG_TYPE_ISR    (0x3456)
 
-static char stack[_STACKSIZE];
-static kernel_pid_t _recv_pid;
-
 at86rf2xx_t devs[AT86RF2XX_NUM];
 
 static const shell_command_t shell_commands[] = {
@@ -44,44 +41,16 @@ static const shell_command_t shell_commands[] = {
 
 static void _event_cb(netdev_t *dev, netdev_event_t event)
 {
-    if (event == NETDEV_EVENT_ISR) {
-        msg_t msg;
-
-        msg.type = MSG_TYPE_ISR;
-        msg.content.ptr = dev;
-
-        if (msg_send(&msg, _recv_pid) <= 0) {
-            puts("gnrc_netdev: possibly lost interrupt.");
-        }
-    }
-    else {
-        switch (event) {
-            case NETDEV_EVENT_RX_COMPLETE:
-            {
-                recv(dev);
-
-                break;
-            }
-            default:
-                puts("Unexpected event received");
-                break;
-        }
-    }
-}
+    switch (event) {
+        case NETDEV_EVENT_RX_COMPLETE:
+        {
+            recv(dev);
 
-void *_recv_thread(void *arg)
-{
-    (void)arg;
-    while (1) {
-        msg_t msg;
-        msg_receive(&msg);
-        if (msg.type == MSG_TYPE_ISR) {
-            netdev_t *dev = msg.content.ptr;
-            dev->driver->isr(dev);
-        }
-        else {
-            puts("unexpected message type");
+            break;
         }
+        default:
+            puts("Unexpected event received");
+            break;
     }
 }
 
@@ -97,6 +66,7 @@ int main(void)
 
         printf("Initializing AT86RF2xx radio at SPI_%d\n", p->spi);
         at86rf2xx_setup(&devs[i], p);
+        netdev_event_thread_init(dev);
         dev->event_callback = _event_cb;
         if (dev->driver->init(dev) < 0) {
             continue;
@@ -110,15 +80,6 @@ int main(void)
         return 1;
     }
 
-    _recv_pid = thread_create(stack, sizeof(stack), THREAD_PRIORITY_MAIN - 1,
-                              THREAD_CREATE_STACKTEST, _recv_thread, NULL,
-                              "recv_thread");
-
-    if (_recv_pid <= KERNEL_PID_UNDEF) {
-        puts("Creation of receiver thread failed");
-        return 1;
-    }
-
     /* start the shell */
     puts("Initialization successful - starting the shell now");
 

Issues/PRs references

Requires #13565 and #13562

@jia200x jia200x added Area: network Area: Networking Type: new feature The issue requests / The PR implemements a new feature for RIOT State: waiting for other PR State: The PR requires another PR to be merged first Area: drivers Area: Device drivers labels Mar 6, 2020
@jia200x jia200x requested a review from benpicco March 6, 2020 13:00
@jia200x
Copy link
Member Author

jia200x commented Mar 6, 2020

I will introduce a netdev_init wrapper to automatically init this (if available), call dev->driver->init, etc.

@kaspar030
Copy link
Contributor

needs rebase

@jia200x
Copy link
Member Author

jia200x commented Mar 6, 2020

needs rebase

Yup. I will rebase when #13565 and #13562 are merged.

@kaspar030
Copy link
Contributor

Yup. I will rebase when #13565 and #13562 are merged.

Yould you giv it a ninbetween rebase now that #13562 is merged?

@jia200x jia200x force-pushed the pr/netdev_event_thread branch from 5af0690 to f448c07 Compare March 9, 2020 10:24
@bergzand
Copy link
Member

@jia200x This PR allows the netdev code to move the netdev::isr call to a separate thread. How are race conditions between the netif and the event thread prevented? From what I see, at least the at86rf2xx driver modifies the driver state during the netdev::isr call (in the at86rf2xx_set_state() function) and this is not guarded.

If you want to continue with this I think we should first make sure that any driver context modifications are guarded. Furthermore I think there should also be a modification to the netdev documentation to point out that the netdev::isr function can be called from a different thread and that the necessary measures should be in place.

@bergzand
Copy link
Member

@jia200x Do you have any measurements on what the impact is on performance when netdev::isr is handled in a separate thread? Both on flash/ram and radio reception to the end of the netif recv latency?

@jia200x
Copy link
Member Author

jia200x commented Mar 12, 2020

@jia200x This PR allows the netdev code to move the netdev::isr call to a separate thread. How are race conditions between the netif and the event thread prevented? From what I see, at least the at86rf2xx driver modifies the driver state during the netdev::isr call (in the at86rf2xx_set_state() function) and this is not guarded.

This problem pops up because most of our drivers are not thread safe :/
The only way to solve this would be to design them to be like that.

I agree, that in the current state we cannot just plug in a netif on top.

If you want to continue with this I think we should first make sure that any driver context modifications are guarded. Furthermore I think there should also be a modification to the netdev documentation to point out that the netdev::isr function can be called from a different thread and that the necessary measures should be in place.

Maybe we should also start making the drivers thread safe. I know some of them are (ethernet devices).

@jia200x
Copy link
Member Author

jia200x commented Mar 12, 2020

@jia200x Do you have any measurements on what the impact is on performance when netdev::isr is handled in a separate thread? Both on flash/ram and radio reception to the end of the netif recv latency?

It's hard to do a fair comparison now (since the standard "recv" thread is heavier than the event_thread thread), but I wouldn't expect big differences (besides the fact you cannot miss an event_post and the fact they are more efficient compared to IPC messages).

Radio reception shouldn't be affected, since it's the same principle (a thread wakes up ASAP to process the radio ISR)

@bergzand
Copy link
Member

This problem pops up because most of our drivers are not thread safe :/
The only way to solve this would be to design them to be like that.

Yeah, there was never a need to design them in a thread safe way. It would be good to have a rough idea of the impact here as making them thread safe is not free.

I agree, that in the current state we cannot just plug in a netif on top.

I would phrase this as that we can't plug this PR on top of netdev in the current state, but yeah. Do you want me to block this to prevent accidental merges?

It's hard to do a fair comparison now (since the standard "recv" thread is heavier than the event_thread thread), but I wouldn't expect big differences (besides the fact you cannot miss an event_post and the fact they are more efficient compared to IPC messages).

I still think it is good to have this confirmed. Current netif uses messages, but it could easily use events for the netdev::isr part and reuse the bits here (but keep it in a single thread). That might even be a good intermediate step which doesn't require thread safe drivers (except for the actual irq part, but that's already covered).

Radio reception shouldn't be affected, since it's the same principle (a thread wakes up ASAP to process the radio ISR)

Well, in this case I think it could be interesting to get some performance measurements in situations where two network interfaces (or any two drivers) are competing for attention from this event thread. So is there a worst case situation for the event handler here and how does it perform compared to the current implementation. This way we can determine whether it is worth the additional overhead, same for making the drivers thread safe.

@miri64
Copy link
Member

miri64 commented Mar 12, 2020

I still think it is good to have this confirmed. Current netif uses messages, but it could easily use events for the netdev::isr part and reuse the bits here (but keep it in a single thread). That might even be a good intermediate step which doesn't require thread safe drivers (except for the actual irq part, but that's already covered).

See #9326

@bergzand
Copy link
Member

bergzand commented Mar 12, 2020

a thread wakes up ASAP to process the radio ISR

In case of the event thread architecture, this is not necessarily true and it is possible that the event thread does not pre-empt the 'normal' netif thread to handle the netdev::isr event.

In a situation with two network interfaces, the event thread could be blocked on a mutex lock because it was triggered while that mutex was already locked (for example during a relative long frame buffer load of an SPI-connected radio). This means that the event thread is blocked on the mutex waiting for a lower priority thread to exit the critical section. In this situation if a second event (from a different radio) would be submitted, it would not be handled 'ASAP' but after the first event is done. Now assuming both radio's have an identical priority this should . But if there is a situation where these two threads do not have equal priority, the netif thread could be scheduled first to handle a new netdev::send request from a higher layer.

The second netif thread here is only an example and this could be swapped for anything using the same event thread, such as a graphics stack (lvgl) requiring a callback after rendering is done or a high priority sensor read. This example applies to any shared usage of the event thread which includes the usage of blocking functions in the event handlers.

I can draw this example in a timing schedule as I can imagine that it is not entirely clear. Let me know if you want me to do that.

Edit: okay, maybe it is still ASAP, but it might not be the same ASAP you're thinking of here.

@benpicco
Copy link
Contributor

Mind you that currently it's the responsibility of the radio driver to make calls to send() block while a transmission is ongoing, so this would also be an issue with multiple interfaces.

(Ideally the driver would return -EBUSY and the upper layer would would queue the packet and transmit it when it receives the TX complete event from the driver)

@jia200x
Copy link
Member Author

jia200x commented Mar 12, 2020

In a situation with two network interfaces, the event thread could be blocked on a mutex lock because it was triggered while that mutex was already locked (for example during a relative long frame buffer load of an SPI-connected radio). This means that the event thread is blocked on the mutex waiting for a lower priority thread to exit the critical section. In this situation if a second event (from a different radio) would be submitted, it would not be handled 'ASAP' but after the first event is done. Now assuming both radio's have an identical priority this should . But if there is a situation where these two threads do not have equal priority, the netif thread could be scheduled first to handle a new netdev::send request from a higher layer.

But if you are loading a packet, it's because you already put your radio in "TX_ON" mode. That's why the PD-DATA command of the IEEE802.15.4 PHY specification (section 6.2.1.1), requires a PLME.SET-TRX-STATE to TX_ON before.
So, the ISR won't be triggered randomly.

As @benpicco describes, the MAC layer should queue the packet and retransmit if the radio was receiving.

Also, ideally the radio won't be in RX_ON or TRX_OFF if it didn't receive a TX_DONE event (the radio will wait until the state transition, so there aren't spurious ISRs).

Basically, we can avoid all kind of priority inversion if we are careful enough

@jia200x
Copy link
Member Author

jia200x commented Mar 12, 2020

Anyway, we can evaluate the effect of thread safe drivers and several devices being processed by the same event_thread

@jia200x
Copy link
Member Author

jia200x commented Apr 3, 2020

On seconds thoughts, the whole problem would be solved if the scheduler is not affected by priority inversion

@kaspar030
Copy link
Contributor

On seconds thoughts, the whole problem would be solved if the scheduler is not affected by priority inversion

You mean for example, when the radio SPI is also used by a low-priority driver thread? Otherwise, which mutex would the radio event handler be blocked on that would benefit from priority inversion?

@jia200x
Copy link
Member Author

jia200x commented Apr 3, 2020

You mean for example, when the radio SPI is also used by a low-priority driver thread? Otherwise, which mutex would the radio event handler be blocked on that would benefit from priority inversion?

Let's say you have 3 threads A, B, C (A high prio, B mid, C low).

C uses the SPI configures the HW address of the radio. An interrupt from the radio arrives so A (e.g event_thread) tries to access the radio, which is blocked by C. So far so good.

The problem happens when B gets control and starts doing a long task (let's say there's a 5 seconds spin lock). C won't get back the control until B finishes. So, A will be locked until B finishes.

When B finishes, C will release the mutex and A will be released. But the event thread was locked all the time

@jia200x
Copy link
Member Author

jia200x commented Apr 3, 2020

It' could be ok that A waits for some time, but what shouldn't be tolerable is the fact that A can be blocked by B

@kaspar030
Copy link
Contributor

kaspar030 commented Apr 8, 2020

Let's say you have 3 threads A, B, C (A high prio, B mid, C low).

C uses the SPI configures the HW address of the radio. An interrupt from the radio arrives so A (e.g event_thread) tries to access the radio, which is blocked by C. So far so good.

The problem happens when B gets control and starts doing a long task (let's say there's a 5 seconds spin lock). C won't get back the control until B finishes. So, A will be locked until B finishes.

When B finishes, C will release the mutex and A will be released. But the event thread was locked all the time

Hm, if B has a higher priority than C, it might have more important stuff to do than B C. RIOT's scheduler guarantees that in that case, B stays active. Temporarily giving C the priority of A messes with that guarantee.

C should send something to A, asking it to please "use the SPI to configure the HW address of the radio" when there is time.

Or, C should have higher priority than B.

@jia200x
Copy link
Member Author

jia200x commented Apr 8, 2020

Hm, if B has a higher priority than C, it might have more important stuff to do than B C. RIOT's scheduler guarantees that in that case, B stays active. Temporarily giving C the priority of A messes with that guarantee.

This is not entirely true. B was defined on purpose as a thread with lower priority than C. There was a reason why the code in B was not in C.
Actually what is happening here is a violation of the priority model. A lower priority thread (B) takes the control over the highest priority thread.

C should send something to A, asking it to please "use the SPI to configure the HW address of the radio" when there is time.

This is the way we handle things now. Are we OK with this procedure?, considering that:

  • There's an overhead of passing messages to a higher priority thread (runtime cost and memory consumption)
  • It makes things more complicated (a user could simply cally a "send" function from the thread instead of requiring an interface to pass instructions to the SPI

Or, C should have higher priority than B.

E.g C is the IRQ handler, B is a beacon sender (with timers) and A is the main thread. You cannot give more priority to B in this case

@jia200x
Copy link
Member Author

jia200x commented Apr 8, 2020

Note this is regardless of the outcome of #9379, but to me this IS indeed a use case for #7445

@kaspar030
Copy link
Contributor

E.g C is the IRQ handler, B is a beacon sender (with timers) and A is the main thread. You cannot give more priority to B in this case

I think you mixed the priorities, in your previous example, "A" had the highest. I think you mean this:

E.g A is the IRQ handler, B is a beacon sender (with timers) and C is the main thread. You cannot give more priority to B in this case

The issue was A waiting on a resource shared with C, and B jumping in, causing C and thus A to wait. If that is an issue, consider incresing the priority of C to something higher than B.

@bergzand
Copy link
Member

bergzand commented Apr 8, 2020

The issue was A waiting on a resource shared with C, and B jumping in, causing C and thus A to wait. If that is an issue, consider increasing the priority of C to something higher than B.

This only makes sense when C is locking the shared resources is locked most of the time while active. If C is active for a duration where the shared resource is only a small portion of this time, reversing the priority of C and B would make B block for this full duration instead of only the duration the shared resource is required.

@jia200x
Copy link
Member Author

jia200x commented Apr 8, 2020

Ah yes! I swapped A and C. C should have the highest

The issue was A waiting on a resource shared with C, and B jumping in, causing C and thus A to wait. If that is an issue, consider incresing the priority of C to something higher than B.

Consider this:

  • A is a thread that gets the value of a sensor, performs heavy calculation and send a packet. Let's say the whole process takes 700 ms, but sending is in the order of 1 ms.
  • B is a thread that checks every 100 ms that a sensor value doesn't exceed a threshold and activate an alarm if happened. The process take 30 ms.
  • C is a time sensitive MAC layer (e.g TSCH) that must receive packets with timestamps (with reception windows of ~10 ms)

Clearly C needs the highest priority here.
If we give A a higher priority than B, B won't be able to detect a threshold violation in 700 ms, so B requires a higher priority than A.

Let's say A blocks the radio for 1 ms, and just before state changes a packet arrives. C will try to access the radio, but it's locked by the mutex in A. At the same time, the timer of B triggers and the thread starts measuring. C cannot be released until B finished (30 ms).
So, without a priority inheritance mechanism, we potentially lost 2-3 packets.

With priority inversion, A will take back the control and finish sending (1 ms). Then, C is scheduled again (packet is not lost) and finally B gets control.

EDIT: Just noticed @bergzand comment. It's basically what he describes

@benpicco
Copy link
Contributor

benpicco commented Oct 2, 2020

This needs a rebase.

@stale
Copy link

stale bot commented Jun 3, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want me to ignore this issue, please mark it with the "State: don't stale" label. Thank you for your contributions.

@stale stale bot added the State: stale State: The issue / PR has no activity for >185 days label Jun 3, 2021
@jia200x
Copy link
Member Author

jia200x commented Jun 9, 2021

I think we can close this PR. There are currently more mechanisms that replace this (gnrc_netif_events and the newer approaches with event threads).

@jia200x jia200x closed this Jun 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: drivers Area: Device drivers Area: network Area: Networking State: stale State: The issue / PR has no activity for >185 days State: waiting for other PR State: The PR requires another PR to be merged first Type: new feature The issue requests / The PR implemements a new feature for RIOT
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants