Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arm64 tick timer fixes #13674

Merged
merged 3 commits into from
Sep 30, 2024
Merged

Conversation

jlaitine
Copy link
Contributor

Summary

It has been observed in our testing with imx9 platform, that nxsem_tickwait_uninterruptible randomly wakes up too early. As this is commonly used in device drivers as a timeout timer, this causes early timeouts resulting many errors. Further investigation revealed that the tick timer in arch_alarm/oneshot_callback sometimes ticks two times at one call. This would then cause the wdog timer to timeout immediately even if it was supposed to wait for one more tick.

The issue was root-caused to these things:

  • The arm64 tick timer is drifiting due to the new compare value is being set only after the tick has been handled, and based on the timer value of that moment.
  • The current tick retrieved from the arm64 timer changes during the tick period, based on the tick start counter value, due to division rounding
  • There is a possibility for signal waking up nxsem_tickwait_uninterruptible 1 tick too early
  • arch_alarm/oneshot_callback doesn't read the current tick and start the new tick atomically w. regard to the freely running HW timer

This PR proposes solution to the issues by

  • aligning the tick start to a value which is evenly divisible with the tick time.
  • fixing the signalling wakeup issue for nxsem_tickwait_uninterruptible

The alignment of tick start fixes both the tick timer drift and current tick retrieval issues. It is then not necessary to add an interface for the oneshot alarm to support atomic get&start tick timer.

This patchset also adds a cleanup for arch_alarm oneshot_callback. This shouldn't fix any bug per-se, but as we have had that patch included in all the testing, it is included in this patchset. What it fixes in theory is, that the oneshot_callback could be called more frequently than the tick period without errors - but this is not used for anything atm.

Impact

Fixes random errors in several device drivers for imx9 platform, in longer test runs.

Testing

Tested on i.MX93 platform with some added debug instrumentation.

@github-actions github-actions bot added Arch: arm64 Issues related to ARM64 (64-bit) architecture Area: OS Components OS Components issues Size: S The size of the change in this PR is small labels Sep 27, 2024
@nuttxpr
Copy link

nuttxpr commented Sep 27, 2024

[Experimental Bot, please feedback here]

Let's break down this pull request (PR) against the NuttX requirements:

Summary:

  • Why? This PR addresses a critical issue where the nxsem_tickwait_uninterruptible function, crucial for device driver timeouts, prematurely wakes up on the imx9 platform.
  • What? The PR tackles the root causes of this issue: timer drift in the arm64 architecture, inconsistent tick retrieval, early signal wake-ups, and non-atomic tick handling in arch_alarm/oneshot_callback.
  • How? The PR fixes the timer drift and tick retrieval by aligning the tick start. It also resolves the signal wake-up issue and cleans up the oneshot_callback.
  • Issues: The PR should ideally reference a specific NuttX issue number where this problem is discussed (e.g., "Fixes arch/stackframe: fix heap buffer overflow #1234").

Impact:

  • New Feature: NO
  • Existing Feature Changed: YES, the timing behavior of nxsem_tickwait_uninterruptible and the underlying timer mechanism are fixed.
  • User Impact: Potentially YES. While this fixes bugs, users relying on the previous (incorrect) timing behavior might experience changes. This should be clearly documented.
  • Build Impact: Possibly YES. If the cleanup in oneshot_callback affects its interface, there might be build implications for code using it. This needs clarification.
  • Hardware Impact: YES, specifically for the imx9 platform where the issue manifests.
  • Documentation Impact: YES, documentation for nxsem_tickwait_uninterruptible and potentially oneshot_callback should be updated to reflect the fixes and any behavioral changes.
  • Security Impact: Possibly YES. Timing issues can sometimes have security implications. While the PR aims to fix a problem, it's crucial to analyze if the changes introduce any new vulnerabilities.
  • Compatibility Impact: Potentially YES, as mentioned under User Impact. Backwards compatibility needs careful consideration.
  • Anything else? The PR mentions additional debug instrumentation used for testing. Consider removing this instrumentation or clearly marking it as such before merging.

Testing:

  • Insufficient: The PR lacks details about the build hosts and target configurations used for testing. Provide specific information like OS, CPU architecture, compiler, and the exact imx9 board and configuration.
  • Logs Missing: The PR needs to include actual testing logs from before and after the changes. These logs are essential for reviewers to understand the problem and verify the fix.

Conclusion:

This PR addresses a significant issue but requires further refinement to meet NuttX's standards:

  1. Link to NuttX Issue: Reference the relevant issue.
  2. Clarify Impact: Provide specifics on build, user, and compatibility impacts.
  3. Detail Testing: Include comprehensive testing information and logs.
  4. Security Review: Given the nature of the fix, a security review is recommended.

By addressing these points, the PR will be well-positioned for review and merging into NuttX.

@Fix-Point
Copy link
Contributor

However, we were unable to reproduce the issue on qemu/x86_64 and qemu/armv8a (commit d47076). Here is the test code:

#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < sizeof(waittick_tbl)/sizeof(uint32_t) ; i++)
    {
      wait = waittick_tbl[i];

      tick = clock_systime_ticks();
      nxsem_tickwait(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick1 %u : %lu\n", wait, tick);
      // assert(wait <= tick);

      tick = clock_systime_ticks();
      nxsem_tickwait_uninterruptible(&sem, wait);
      tick = clock_systime_ticks() - tick;
      printf("tick2 %u : %lu\n", wait, tick);
      // assert(wait <= tick);
    }

  return 0;
}

@jlaitine
Copy link
Contributor Author

jlaitine commented Sep 29, 2024

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

@Fix-Point
Copy link
Contributor

Fix-Point commented Sep 29, 2024

I actually never tested whether this is reproducible in qemu; that would depend on the timer accuracy in qemu/linux. It is really easy to reproduce in real hw within approx 10 minutes of running. I can check the qemu reproducibility tomorrow. Thanks!

Oh! And the test above wouldn't even show the issue; it does count the correct number of ticks, just too fast (sometimes two ticks at once). So you need to measure the real time spent over the tickwait. So take the clock_gettime or up_timer_gettime before and after the tickwait and check that the correct amount of time has passed, not the amount of ticks :)

Thank you for providing more information. However, after changing to the clock_gettime, I still cannot reproduce the issue on qemu/aarch64. I will try to conduct further investigations on a real armv8 machine.

Here is my test code:

#include <fcntl.h>
#include <nuttx/config.h>
#include <stdio.h>
#include <semaphore.h>
#include <nuttx/clock.h>
#include <nuttx/arch.h>

static inline time_t diff_time(struct timespec *ts_start, struct timespec *ts_end){
  return (ts_end->tv_sec - ts_start->tv_sec) * NSEC_PER_SEC + (ts_end->tv_nsec - ts_start->tv_nsec);
}

/****************************************************************************
 * Public Functions
 ****************************************************************************/
int main(int argc, FAR char *argv[])
{
  sem_t sem;
  clock_t  tick;
  uint32_t wait;
  time_t diff;
  struct timespec ts_start;
  struct timespec ts_end;

  /* waittick table */
  static const uint32_t waittick_tbl[] = 
  { 
    0, 1, 10, 100, 1000, 10000, 100000, 1000000 
  };

  sem_init(&sem, 0, 0);



  for (uint32_t i = 0; i < 10000000; i++)
    {
      wait = waittick_tbl[i%2];

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick1 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);

      tick = clock_systime_ticks();
      clock_gettime(0, &ts_start);
      nxsem_tickwait_uninterruptible(&sem, wait);
      clock_gettime(0, &ts_end);
      tick = clock_systime_ticks() - tick;
      diff = diff_time(&ts_start, &ts_end);
      printf("tick2 %u : %lu, %ld ns\n", wait, tick, diff);
      assert(wait <= tick && diff >= 0);
    }

  return 0;
}

@jlaitine
Copy link
Contributor Author

jlaitine commented Sep 29, 2024

issue re-produces also in qemu, at least with this simple test.

#include <nuttx/nuttx.h>
#include <nuttx/config.h>
#include <stdio.h>

#define read_sysreg(reg)                         \
  ({                                             \
    uint64_t __val;                              \
    __asm__ volatile ("mrs %0, " STRINGIFY(reg)  \
                    : "=r" (__val) :: "memory"); \
    __val;                                       \
  })

static inline uint64_t arm64_arch_timer_count(void)
{
  return read_sysreg(cntvct_el0);
}

static inline uint64_t arm64_arch_timer_get_cntfrq(void)
{
  return read_sysreg(cntfrq_el0);
}

static inline uint64_t time_us(void)
{
  return arm64_arch_timer_count() * 1000000 / arm64_arch_timer_get_cntfrq();
}

inline uint64_t minutes_in_ticks(uint64_t mins)
{
  return mins * 60 * 1000000 / CONFIG_USEC_PER_TICK;
}

int main(int argc, FAR char *argv[])
{
  sem_t sem;
  uint64_t t1;
  uint64_t t2;
  sem_init(&sem, 0, 0);

  for (int64_t i = 0; i < minutes_in_ticks(10); i++)
    {
      t1 = time_us();
      nxsem_tickwait(&sem, 1);
      t2 = time_us();

      if (i % 100 == 0)
        {
          printf("%lu %lu %lu\n",t1,t2,t2-t1);
        }

      if (t2 - t1 < CONFIG_USEC_PER_TICK)
        {
          printf("ERROR: slept %lu us instead of minimum %u\n", t2-t1, CONFIG_USEC_PER_TICK);
        }

      assert(t2-t1 > CONFIG_USEC_PER_TICK); 
    }

  return 0;
}


…k timer

This fixes two issues with the tick timer
1) Each tick was longer than the requested period. This is because setting
   the compare register was done by first reading the current time, and only
   after that setting the compare register. In addition, when handling the
   timer interrupts in arch_alarm.c / oneshot_callback, the current_tick is
   first read, all the tick handling is done and only after that the next tick
   is started. The whole tick processing time was added to the total tick time.

2) When the compare time is not aligned with tick period, and is drifting,
   eventually any call to ONESHOT_TICK_CURRENT would either return the current
   tick, or the next one, depending on the rounding of division by the
   cycle_per_tick. This again leads to oneshot_callback randomly handling
   two ticks at a time, which breaks all wdog based timers, causing them to
   randomly timeout too early.

The issues are fixed as follows:

Align the compare time register to be evenly divisible by cycle_per_tick.
This will lead arm64_tick_current always to return the currently ongoing tick,
fixing 2). Also calculating the next tick's start from the aligned current
count will fix 1), as there is no time drift in the start cycle.

Signed-off-by: Jukka Laitinen <[email protected]>
…d condition

nxsem_tickwait correctly sleeps more than 1 tick. But nxsem_tickwait_uninterruptible
may wake up to a signal (with -EINTR), in which case the tick + 1 must also
be taken into account. Otherwise the nxsem_tickwait_uninterruptible may
wake up 1 tick too early.

Also fix the nxsem_tickwait to return with -ETIMEDOUT if called with delay 0.
This is similar to e.g.  posix sem_timedwait.

Signed-off-by: Jukka Laitinen <[email protected]>
@jlaitine jlaitine force-pushed the arm64_tick_timer_fixes branch from 8268a47 to a2c0d6f Compare September 29, 2024 19:34
@jlaitine
Copy link
Contributor Author

I added a minimal test code above, which can be used in both real target and qemu; it shows the major issue typically within 1 minute.

@jlaitine
Copy link
Contributor Author

@Fix-Point Were you able to re-produce the issue with the code I pasted above?

@xiaoxiang781216 xiaoxiang781216 merged commit 07d5bc2 into apache:master Sep 30, 2024
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: arm64 Issues related to ARM64 (64-bit) architecture Area: OS Components OS Components issues Size: S The size of the change in this PR is small
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants