Arm64 tick timer fixes #13674

jlaitine · 2024-09-27T10:12:13Z

Summary

It has been observed in our testing with imx9 platform, that nxsem_tickwait_uninterruptible randomly wakes up too early. As this is commonly used in device drivers as a timeout timer, this causes early timeouts resulting many errors. Further investigation revealed that the tick timer in arch_alarm/oneshot_callback sometimes ticks two times at one call. This would then cause the wdog timer to timeout immediately even if it was supposed to wait for one more tick.

The issue was root-caused to these things:

The arm64 tick timer is drifiting due to the new compare value is being set only after the tick has been handled, and based on the timer value of that moment.
The current tick retrieved from the arm64 timer changes during the tick period, based on the tick start counter value, due to division rounding
There is a possibility for signal waking up nxsem_tickwait_uninterruptible 1 tick too early
arch_alarm/oneshot_callback doesn't read the current tick and start the new tick atomically w. regard to the freely running HW timer

This PR proposes solution to the issues by

aligning the tick start to a value which is evenly divisible with the tick time.
fixing the signalling wakeup issue for nxsem_tickwait_uninterruptible

The alignment of tick start fixes both the tick timer drift and current tick retrieval issues. It is then not necessary to add an interface for the oneshot alarm to support atomic get&start tick timer.

This patchset also adds a cleanup for arch_alarm oneshot_callback. This shouldn't fix any bug per-se, but as we have had that patch included in all the testing, it is included in this patchset. What it fixes in theory is, that the oneshot_callback could be called more frequently than the tick period without errors - but this is not used for anything atm.

Impact

Fixes random errors in several device drivers for imx9 platform, in longer test runs.

Testing

Tested on i.MX93 platform with some added debug instrumentation.

nuttxpr · 2024-09-27T10:19:40Z

[Experimental Bot, please feedback here]

Let's break down this pull request (PR) against the NuttX requirements:

Summary:

Why? This PR addresses a critical issue where the nxsem_tickwait_uninterruptible function, crucial for device driver timeouts, prematurely wakes up on the imx9 platform.
What? The PR tackles the root causes of this issue: timer drift in the arm64 architecture, inconsistent tick retrieval, early signal wake-ups, and non-atomic tick handling in arch_alarm/oneshot_callback.
How? The PR fixes the timer drift and tick retrieval by aligning the tick start. It also resolves the signal wake-up issue and cleans up the oneshot_callback.
Issues: The PR should ideally reference a specific NuttX issue number where this problem is discussed (e.g., "Fixes arch/stackframe: fix heap buffer overflow #1234").

Impact:

New Feature: NO
Existing Feature Changed: YES, the timing behavior of nxsem_tickwait_uninterruptible and the underlying timer mechanism are fixed.
User Impact: Potentially YES. While this fixes bugs, users relying on the previous (incorrect) timing behavior might experience changes. This should be clearly documented.
Build Impact: Possibly YES. If the cleanup in oneshot_callback affects its interface, there might be build implications for code using it. This needs clarification.
Hardware Impact: YES, specifically for the imx9 platform where the issue manifests.
Documentation Impact: YES, documentation for nxsem_tickwait_uninterruptible and potentially oneshot_callback should be updated to reflect the fixes and any behavioral changes.
Security Impact: Possibly YES. Timing issues can sometimes have security implications. While the PR aims to fix a problem, it's crucial to analyze if the changes introduce any new vulnerabilities.
Compatibility Impact: Potentially YES, as mentioned under User Impact. Backwards compatibility needs careful consideration.
Anything else? The PR mentions additional debug instrumentation used for testing. Consider removing this instrumentation or clearly marking it as such before merging.

Testing:

Insufficient: The PR lacks details about the build hosts and target configurations used for testing. Provide specific information like OS, CPU architecture, compiler, and the exact imx9 board and configuration.
Logs Missing: The PR needs to include actual testing logs from before and after the changes. These logs are essential for reviewers to understand the problem and verify the fix.

Conclusion:

This PR addresses a significant issue but requires further refinement to meet NuttX's standards:

Link to NuttX Issue: Reference the relevant issue.
Clarify Impact: Provide specifics on build, user, and compatibility impacts.
Detail Testing: Include comprehensive testing information and logs.
Security Review: Given the nature of the fix, a security review is recommended.

By addressing these points, the PR will be well-positioned for review and merging into NuttX.

Fix-Point · 2024-09-29T01:37:15Z

However, we were unable to reproduce the issue on qemu/x86_64 and qemu/armv8a (commit d47076). Here is the test code:

#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < sizeof(waittick_tbl)/sizeof(uint32_t) ; i++)
    {
      wait = waittick_tbl[i];

      tick = clock_systime_ticks();
      nxsem_tickwait(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick1 %u : %lu\n", wait, tick);
      // assert(wait <= tick);

      tick = clock_systime_ticks();
      nxsem_tickwait_uninterruptible(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick2 %u : %lu\n", wait, tick);
      // assert(wait <= tick);
    }

  return 0;
}

jlaitine · 2024-09-29T06:17:35Z

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

Fix-Point · 2024-09-29T12:46:20Z

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

Thank you for providing more information. However, after changing to the clock_gettime, I still cannot reproduce the issue on qemu/aarch64. I will try to conduct further investigations on a real armv8 machine.

Here is my test code:

#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
#include <nuttx/arch.h>

static inline time_t diff_time(struct timespec *ts_start, struct timespec *ts_end){
  return (ts_end->tv_sec - ts_start->tv_sec) * NSEC_PER_SEC + (ts_end->tv_nsec - ts_start->tv_nsec);
}

/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;
  time_t diff;
  struct timespec ts_start;
  struct timespec ts_end;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < 10000000; i++)
    {
      wait = waittick_tbl[i%2];

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick1 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait_uninterruptible(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick2 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);
    }

  return 0;
}

jlaitine · 2024-09-29T14:15:51Z

issue re-produces also in qemu, at least with this simple test.

#include <nuttx/nuttx.h>
#include <nuttx/config.h>
#include <stdio.h>

#define read_sysreg(reg)                         \
  ({                                             \
    uint64_t __val;                              \
    __asm__ volatile ("mrs %0, " STRINGIFY(reg)  \
                    : "=r" (__val) :: "memory"); \
    __val;                                       \
  })

static inline uint64_t arm64_arch_timer_count(void)
{
  return read_sysreg(cntvct_el0);
}

static inline uint64_t arm64_arch_timer_get_cntfrq(void)
{
  return read_sysreg(cntfrq_el0);
}

static inline uint64_t time_us(void)
{
  return arm64_arch_timer_count() * 1000000 / arm64_arch_timer_get_cntfrq();
}

inline uint64_t minutes_in_ticks(uint64_t mins)
{
  return mins * 60 * 1000000 / CONFIG_USEC_PER_TICK;
}

int main(int argc, FAR char *argv[])
{
  sem_t sem;
  uint64_t t1;
  uint64_t t2;
  sem_init(&sem, 0, 0);

  for (int64_t i = 0; i < minutes_in_ticks(10); i++)
    {
      t1 = time_us();
      nxsem_tickwait(&sem, 1);
      t2 = time_us();

      if (i % 100 == 0)
        {
          printf("%lu %lu %lu\n",t1,t2,t2-t1);
        }

      if (t2 - t1 < CONFIG_USEC_PER_TICK)
        {
          printf("ERROR: slept %lu us instead of minimum %u\n", t2-t1, CONFIG_USEC_PER_TICK);
        }

      assert(t2-t1 > CONFIG_USEC_PER_TICK); 
    }

  return 0;
}

…ss build Signed-off-by: Jukka Laitinen <[email protected]>

…k timer This fixes two issues with the tick timer 1) Each tick was longer than the requested period. This is because setting the compare register was done by first reading the current time, and only after that setting the compare register. In addition, when handling the timer interrupts in arch_alarm.c / oneshot_callback, the current_tick is first read, all the tick handling is done and only after that the next tick is started. The whole tick processing time was added to the total tick time. 2) When the compare time is not aligned with tick period, and is drifting, eventually any call to ONESHOT_TICK_CURRENT would either return the current tick, or the next one, depending on the rounding of division by the cycle_per_tick. This again leads to oneshot_callback randomly handling two ticks at a time, which breaks all wdog based timers, causing them to randomly timeout too early. The issues are fixed as follows: Align the compare time register to be evenly divisible by cycle_per_tick. This will lead arm64_tick_current always to return the currently ongoing tick, fixing 2). Also calculating the next tick's start from the aligned current count will fix 1), as there is no time drift in the start cycle. Signed-off-by: Jukka Laitinen <[email protected]>

…d condition nxsem_tickwait correctly sleeps more than 1 tick. But nxsem_tickwait_uninterruptible may wake up to a signal (with -EINTR), in which case the tick + 1 must also be taken into account. Otherwise the nxsem_tickwait_uninterruptible may wake up 1 tick too early. Also fix the nxsem_tickwait to return with -ETIMEDOUT if called with delay 0. This is similar to e.g. posix sem_timedwait. Signed-off-by: Jukka Laitinen <[email protected]>

sched/semaphore/sem_tickwait.c

arch/arm64/src/common/arm64_arch_timer.c

jlaitine · 2024-09-30T07:04:43Z

I added a minimal test code above, which can be used in both real target and qemu; it shows the major issue typically within 1 minute.

jlaitine · 2024-09-30T07:31:27Z

@Fix-Point Were you able to re-produce the issue with the code I pasted above?

github-actions bot added Arch: arm64 Issues related to ARM64 (64-bit) architecture Area: OS Components OS Components issues Size: S The size of the change in this PR is small labels Sep 27, 2024

jlaitine added 3 commits September 29, 2024 22:33

drivers/timers/arch_alarm.c: Clean-up oneshot_callback for non-tickle…

caa8927

…ss build Signed-off-by: Jukka Laitinen <[email protected]>

jlaitine force-pushed the arm64_tick_timer_fixes branch from 8268a47 to a2c0d6f Compare September 29, 2024 19:34

xiaoxiang781216 reviewed Sep 30, 2024

View reviewed changes

sched/semaphore/sem_tickwait.c Show resolved Hide resolved

arch/arm64/src/common/arm64_arch_timer.c Show resolved Hide resolved

xiaoxiang781216 approved these changes Sep 30, 2024

View reviewed changes

xiaoxiang781216 merged commit 07d5bc2 into apache:master Sep 30, 2024
29 checks passed

Fix-Point mentioned this pull request Oct 10, 2024

arch/arm64: Fix clock drift from tick timer #13984

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Arm64 tick timer fixes #13674

Arm64 tick timer fixes #13674

jlaitine commented Sep 27, 2024

nuttxpr commented Sep 27, 2024

Fix-Point commented Sep 29, 2024

jlaitine commented Sep 29, 2024 •

edited

Loading

Fix-Point commented Sep 29, 2024 •

edited

Loading

jlaitine commented Sep 29, 2024 •

edited

Loading

jlaitine commented Sep 30, 2024

jlaitine commented Sep 30, 2024

Arm64 tick timer fixes #13674

Arm64 tick timer fixes #13674

Conversation

jlaitine commented Sep 27, 2024

Summary

Impact

Testing

nuttxpr commented Sep 27, 2024

Fix-Point commented Sep 29, 2024

jlaitine commented Sep 29, 2024 • edited Loading

Fix-Point commented Sep 29, 2024 • edited Loading

jlaitine commented Sep 29, 2024 • edited Loading

jlaitine commented Sep 30, 2024

jlaitine commented Sep 30, 2024

jlaitine commented Sep 29, 2024 •

edited

Loading

Fix-Point commented Sep 29, 2024 •

edited

Loading

jlaitine commented Sep 29, 2024 •

edited

Loading