Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xtimer: timer & target overflow, hang resolved. #9211

Merged
merged 1 commit into from
Nov 2, 2018

Conversation

Josar
Copy link
Contributor

@Josar Josar commented May 28, 2018

This Problems encountered when testing for #8990

  1. The xtimer hung when 32 bit target overflowed, as the timer was not placed in the right list.
  2. The xtimer hung when the hardware timer overflowed, as the comparison was wrong.

grafik

Tested with xtimer_hang on jiminy board.

@smlng smlng requested a review from kaspar030 May 28, 2018 09:58
@smlng smlng added Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation Area: timers Area: timer subsystems CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels May 28, 2018
@kYc0o kYc0o added this to the Release 2018.07 milestone May 28, 2018
@kYc0o
Copy link
Contributor

kYc0o commented May 28, 2018

Looks legit, I'll run the tests on hardware. @kaspar030 can you take a look? It seems a straightforward change.

@kYc0o kYc0o added CI: run tests If set, CI server will run tests on hardware for the labeled PR CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR and removed CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR labels May 28, 2018
@@ -190,6 +190,11 @@ int _xtimer_set_absolute(xtimer_t *timer, uint32_t target)

timer->target = target;
timer->long_target = _long_cnt;

/* Ensure timer is fired in right hardware timer periode. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/periode/period/

Copy link
Contributor

@kaspar030 kaspar030 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems legit to me, too. But xtimer is pretty central, we should run the tests on a couple of more architectures.

@kYc0o
Copy link
Contributor

kYc0o commented May 28, 2018

@Josar can you provide your test example? The current xtimer_hang test succeeds both in master and this PR, though it seems more stable with your patch. Tested on arduino-mega2560.

@kYc0o
Copy link
Contributor

kYc0o commented May 28, 2018

Much more stable with this PR in a z1 (msp430).

@kaspar030
Copy link
Contributor

@Josar Maybe make both commit message and PR title summarize that is changed instead of the intend of the change? This is not the first resolved hang in xtimer, and it won't be the last. ;)

@kaspar030
Copy link
Contributor

@gebart would you take a brief look if we're missing something obviuos?

@kYc0o
Copy link
Contributor

kYc0o commented May 28, 2018

Murdock just agreed, however for some reason in my setup I can't compile xtimer_hang test (both in current master and this PR):

make: Entering directory '/Users/facosta/git/RIOT-OS/RIOT/tests/xtimer_hang'
Building application "tests_xtimer_hang" for "samr21-xpro" with MCU "samd21".

In file included from /Users/facosta/git/RIOT-OS/RIOT/tests/xtimer_hang/main.c:28:0:
/Users/facosta/git/RIOT-OS/RIOT/sys/include/xtimer.h:34:10: fatal error: msg.h: No such file or directory
 #include "msg.h"
          ^~~~~~~
compilation terminated.

@Josar
Copy link
Contributor Author

Josar commented May 28, 2018

The xtimer hang is obviously designed to find errors like this but in the Python Skript than allows a deviation of 10% hiding errors like that again.

The error can also be found with xtimer sleep short when considering the test time, like introduced in #9037.

I tried to introduce probing in #9025 and #8865 so this errors could be be probed in a HIL setup.

Also the error can be visually seen in the output stream stutter, but the test neglect this.

I use the xtimer_hang test and added probing pins., as I saw the print stutter.

@Josar Josar closed this May 28, 2018
@Josar Josar reopened this May 28, 2018
@Josar Josar force-pushed the pr/xtimer_target_overflow branch from fc1047f to 986a636 Compare May 28, 2018 13:34
@Josar Josar changed the title xtimer_core: resolve hang xtimer_core: remove hang bug May 28, 2018
@kYc0o kYc0o added the Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) label May 28, 2018
@Josar Josar changed the title xtimer_core: remove hang bug xtimer: timer & target overflow, hang resolved. May 28, 2018
@Josar Josar force-pushed the pr/xtimer_target_overflow branch 3 times, most recently from 3fce999 to b33175f Compare May 28, 2018 16:16
@kYc0o
Copy link
Contributor

kYc0o commented May 29, 2018

Well now the tests passed.

@cladmi
Copy link
Contributor

cladmi commented May 30, 2018

I successfully run the xtimer_hang test on

  • iotlab-m3
  • iotlab-a8-m3
  • samr21-xpro
  • arduino-zero
  • wsn430-v1_4

I will re-run all automated tests on them.

Copy link
Contributor

@MrKevinWeiss MrKevinWeiss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick spell check!

* This is crucial when using low CPU frequencies so reaching the '_xtimer_now()'
* call needs multiple xtimer ticks.
*
* '_xtimer_set()' and `_xtimer_periodic_wakeup()` ensure this by allready
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spelling mistake allready->already

Copy link
Contributor

@jcarrano jcarrano left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm having a hard time trying to understand what was wrong and how this fixes it.

Also:

Problem is that all test are design in such a way that they do not fail, eg by allowing deviation of 10%.

What's the point of trying to correct for interrupt latency (e.g. XTIMER_OVERHEAD) when the tests allow such a big deviation?

sys/xtimer/xtimer_core.c Show resolved Hide resolved
sys/xtimer/xtimer_core.c Outdated Show resolved Hide resolved
@Josar
Copy link
Contributor Author

Josar commented Oct 26, 2018

@jcarrano

what was wrong

The timer where placed in the wrong list when they where set near the overflow. (line 201 - 247)

The next target (32 bit) was compared to a low lovel timer (unknown bit size) (line 536)

What's the point of trying to correct for interrupt latency (e.g. XTIMER_OVERHEAD) when the tests allow such a big deviation?

IMHO, the test allows such deviation as the underlying problem was not found.
But this fix has nothing to do with correcting interrupt latency, as this was allready implemented.

@jcarrano jcarrano dismissed their stale review October 29, 2018 11:09

Dismissing my own review.

Copy link
Contributor

@MichelRottleuthner MichelRottleuthner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks fine. I tested on arduino-mega2560 and z1 and measured with a logic analyzer. It improves the behavior a lot. On the z1 I noticed that in some cases one of the workers isn't executed until a full wrap-around of the timer happens. This doesn't happen very often, once after the first ~2 seconds and then again after ~80 seconds. With your Test supplied in #10174 this is much easier to see if you use two separate pins for both workers. I'd still agree with the others and say this is a huge improvement and we shuld investigate remaining problems with a follow-up.
Please check the typos / wording I commented below.
[edit]: please squash directly so we can merge soon ;)

/**
* @brief Sets the timer to the appropriate timer_list or list_head.
*
* @note The target to set the timer to has to be at least bigger then then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(...) then then

* '_xtimer_now()' call needs multiple xtimer ticks to evaluate.
*
* @param[in] timer pointer to xtimer_t which is added to the list.
* @param[in] target Absolut target value in ticks.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolute

/* Ensure that offset is bigger than 'XTIMER_BACKOFF',
* 'target - now' will allways be the offset no matter if target < or > now.
*
* This expects that target was not set to close to now and overrun now, so
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not too close to

*
* This expects that target was not set to close to now and overrun now, so
* from setting target up until the call of '_xtimer_now()' above now has not
* become equal and than bigger target.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equal or bigger than target?

* call needs multiple xtimer ticks.
*
* '_xtimer_set()' and `_xtimer_periodic_wakeup()` ensure this by already
* backing of for small values. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backing off

* than 'now', also for values when now will overflow and the value of target
* is smaller then now.
* If `target < XTIMER_OVERHEAD` the new target will be at the end of this
* 32bit period, instead at the beginning of the next.*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new target will not be at the end of this 32bit period but instead at the beginning... ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will be at the end of the running period and not at the beginning of the next.

@Josar Josar force-pushed the pr/xtimer_target_overflow branch from bdf062a to 67f4c19 Compare October 31, 2018 01:26
@Josar
Copy link
Contributor Author

Josar commented Oct 31, 2018

@MichelRottleuthner

I noticed that in some cases one of the workers isn't executed until a full wrap-around of the timer happens. This doesn't happen very often, once after the first ~2 seconds and then again after ~80 seconds.

As i could narrow it down this is caused by setting the timer to the long_list.

_add_timer_to_long_list(&long_list_head, timer);

I think this is because setting the xtimer is not interrupt save.
It is hard to provoke it, but it happens when the low level counter is nearly overflowing.
I also provoked the timer to stop executing not only for a 16-bit wrap-around, but for an 32-bit wrap-around.

@cladmi
Copy link
Contributor

cladmi commented Oct 31, 2018

@Josar I am trying to provide an unit-test with mock for this to show the issue (implementation dependent but still a start).

I have xtimer using my own mock functions of timer while still allowing to use timers in the low levels.
I currently copied xtimer_hang test as a base to just validate it's executing.

https://github.com/cladmi/RIOT/pull/new/pr/wip/write_test_xtimer_overflow_and_hang
https://github.com/cladmi/RIOT/tree/pr/wip/write_test_xtimer_overflow_and_hang

Are you able to describe a minimal test that I could do to show this ?
With values that timer_read would return and would cause issues.

I can fake values of timer_read and execute the timer_callback with a delay.

I will also try to extract something from your description at the same time and do it myself but you may be faster than me.

@cladmi
Copy link
Contributor

cladmi commented Oct 31, 2018

Hmm my branch was based on your changes, I will revert your commits to be able to show an error.

*
* @note The target to set the timer to has to be at least bigger then the
* ticks needed to jump into the function and calculate '_xtimer_now()'.
* So that 'now' did not pass the target.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is hard to guarantee I think. The problem is that the calling task may be de-scheduled for a few ticks and then you miss it anyway.

As it is a private function, is it not possible to update the API to pass it the now and offset value to handle it ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really don't get this requirement. How can it be that for correct operation one needs to know the time it gets to get to a certain line? IMO this is very fragile.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To not pass now and offset, another solution could be to just assume (and ensure) that offset < UINT32_MAX - MAX_MARGIN where MAX_MARGIN is the maximum time we allow to enter the function and read now again.

This way, you are sure that a timer loop can be detected if you took less than MAX_MARGIN to go to now.

A large enough solution could be to say it fits in an int32_t so uses 31bits.
It is an easy thing to ensure over the code base.
With 1Mhz it would still allow working if you take 35 minutes between the first call and being in the function.

Copy link
Contributor Author

@Josar Josar Oct 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no problem to guarantee. XTIMER_BACKOFF has to be set to as much ticks as it needs to reach the backing-off case.

I allready started an test which helps to extract the required values.
See: #8990

This requirement is based on the fact that the time is ticking while you try to set a timer. If you set your deadline too short you will miss it. Thats it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at #8990 as a base.

When you start calling _xtimer_set_absolute you are still in your context, and you could be de-scheduled, for some time (maybe more than XTIMER_BACKOFF), so you do not know that the target is not elapsed, or not too close from your calling function.
Of course if you call the function with something that ends up being in the past, it should _shoot it directly, but you cannot do better as it was asked too late.

All the calling functions of _xtimer_set_absolute already have a now value (2 functions + 1 test).
By getting now and offset you remove the issues with "calling offset was too short" issue and handle this in the function directly.

However I see another issue also with the current implementation, it is that when now is read in the function, and offset is evaluated it is not with masked interrupts. So everything could also go wrong there too, be de-scheduled and get stuck with xtimer_spin_until.

The only hard limitation we cannot do anything about, is that between the first now and the now in the function, there should not be more than UINT32_MAX / Frequency ticks.

I wanted to say 2min23 for 32Mhz timers, but it looks like there are only sub MHz timers.
Even if there is this comment about cc2538 using a 16Mhz timer and code to handle them.

/* e.g. cc2538 uses a 16 MHz timer */

git grep XTIMER_HZ  . ':!sys'
boards/common/arduino-atmega/include/board_common.h:#define XTIMER_HZ                   (250000UL)
boards/frdm-kw41z/include/board.h:#define XTIMER_HZ                   (32768ul)
boards/hifive1/include/periph_conf.h:#define XTIMER_HZ                   (32768ul)
boards/ikea-tradfri/include/board.h:#define XTIMER_HZ           (250000UL)
boards/jiminy-mega256rfr2/include/board.h:#define XTIMER_HZ           (125000UL)
boards/mega-xplained/include/board.h:#define XTIMER_HZ                   (125000UL)
boards/mulle/include/board.h:#define XTIMER_HZ                   (32768ul)
boards/slstk3401a/include/board.h:#define XTIMER_HZ           (250000UL)
boards/slstk3402a/include/board.h:#define XTIMER_HZ           (250000UL)
boards/sltb001a/include/board.h:#define XTIMER_HZ           (250000UL)
boards/slwstk6000b/include/board.h:#define XTIMER_HZ           (250000UL)
boards/stk3600/include/board.h:#define XTIMER_HZ           (250000UL)
boards/stk3700/include/board.h:#define XTIMER_HZ           (250000UL)
boards/waspmote-pro/include/board.h:#define XTIMER_HZ                   (62500UL)
cpu/esp32/periph/timer.c:    CHECK_PARAM_RET (freq == XTIMER_HZ_BASE, -1);
cpu/esp32/periph/timer.c:    CHECK_PARAM_RET (freq == XTIMER_HZ_BASE, -1);
cpu/esp8266/periph/timer.c:    CHECK_PARAM_RET (freq == XTIMER_HZ_BASE, -1);
cpu/esp8266/periph/timer.c:    CHECK_PARAM_RET (freq == XTIMER_HZ_BASE, -1);
tests/bench_timers/Makefile:test-xtimer: CFLAGS+=-DTEST_XTIMER -DTIM_TEST_FREQ=XTIMER_HZ -DTIM_TEST_DEV=XTIMER_DEV

Looking at all this really needs a lot of concentration D:

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could propose something with my ideas and see if it would help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you see #7630 regarding passing a reference time as argument to _xtimer_set_absolute?
It needs a rebase but it is related to this discussion I think.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not, I just jumped in the xtimer issue to provide a test, but it means that I had a similar reasoning. I will provide a minimal test to show an error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gebart I think your fix still misses something, from what I get, the function should be calling xtimer_now from within a masked interrupt context, so that the time between this now and the periph_timer_set can be guaranteed to be fixed as you cannot be de-scheduled.

If not the comparison with XTIMER_BACKOFF does not guarantee anything.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to show these with tests.

@Josar
Copy link
Contributor Author

Josar commented Oct 31, 2018

@cladmi i think a low level timer mock is a good idea.

I think a test just needs to set the counter register of the lltimer to the end of the period before setting the xtimers. And also set the 32 bit low counter of the xtimer to the end of a period before setting the xtimer.

Basically testing is needed for

  1. First timer is at end of lltimer period and 2 timer is in next lltimer period.
  2. First timer is at the end of the 32 bit period and 2 timer is at the beginning of the next 32 bit period.
  3. A lltimer overflow happens while timer 1 is allready processed and waiting for timer 2 in the interrupt routine. But after line 496 if (reference > _xtimer_lltimer_now()) {

@cladmi
Copy link
Contributor

cladmi commented Nov 1, 2018

And also set the 32 bit low counter of the xtimer to the end of a period before setting the xtimer

You are talking about _xtimer_high_cnt right ?
If I am right, it is to handle boards where the low level timer is less than 32 bit.
As setting the lltimer close to overflow would not be enough to trigger issues.

Your test description is to show the fix in the interrupt handler right ?

I think I will try to show only the issue with _xtimer_set_absolute alone first.
This way I have only one timer :)

@Josar
Copy link
Contributor Author

Josar commented Nov 1, 2018

  1. Placing the target in the right list affects both 32 and not 32 hardware timers.

If the target is at the beginning of the next low level timer period it will end in the long_list, regardless of the fact that target - XTIMER_BACKOFF is in this low level period. But when this period ends and it is set as new low level timer target it will be placed at the end of the new low level period.

  1. Sheduling the next target in _timer_callback is a problem for not 32 bit low level timer.

  2. The target overun problem is only stated here to ensure XTIMER_BACKOFF is chosen accordingly.

As the test supplied in #10174 enables to see issues, this does not mean that this PR handles them all. It was a first step to even out the most trivial bugs which had a big impact.

De-sheduling is a seperate problem. And can happen at multiple times, this is a problem of not being interrupt safe. Basically a thread might be de-sheduled between reading the xtimer value untill it sets its new target. But this is not meant to be handled in this PR and requires a substantial rework.

@MichelRottleuthner
Copy link
Contributor

MichelRottleuthner commented Nov 1, 2018

As the test supplied in #10174 enables to see issues, this does not mean that this PR handles them all. It >was a first step to even out the most trivial bugs which had a big impact.

I think this is important to underline - this PR is already a huge improvement and we shouldn't try to fix all the problems at once-> IMO: test the PR on various boards/archs as is and merge. We can fix the remaining problems (that are now easier to find) with follow-ups. This PR was hanging around here for long enough...

@jia200x
Copy link
Member

jia200x commented Nov 1, 2018

I'm running all test with this branch on samr21-xpro. Post results later

@leandrolanzieri
Copy link
Contributor

@Josar please rebase now that #10174 has been merged.

1. When the 32 bit target of the xtimer overflowed the timer was not placed in the right list.
2. When the hardware timer overflowed the comparison was wrong for setting next target.

3. Backoff condition
@Josar Josar force-pushed the pr/xtimer_target_overflow branch from b9a8fa1 to 6723689 Compare November 1, 2018 15:49
@Josar
Copy link
Contributor Author

Josar commented Nov 1, 2018

@leandrolanzieri done.

@jia200x
Copy link
Member

jia200x commented Nov 1, 2018

These are the results for compile-and-test-for-board.py with this PR on samr21-xpro. Only 2 errors, completely unrelated to this PR.

ERROR:samr21-xpro:Tests failed: 2
Failures during test:
- [tests/driver_my9221](tests/driver_my9221/test.failed)
- [tests/pkg_fatfs_vfs](tests/pkg_fatfs_vfs/test.failed)

@leandrolanzieri
Copy link
Contributor

Here are some measurements I did to compare results of master to this PR.
Arduino Duemilanove (ATmega328)

  • Master
    master_2009

  • With PR
    pr_applied_2009

  • Conclusion
    In master multiple gaps can be seen in workers and main pins. After the PR was applied that only happened 3 times (always with the same gap time, maybe caused by what's mentioned here)? but it was improved a lot.

Arduino DUE

  • Master
    master_due

  • With PR
    pr_applied_due

  • Conclusion
    Could not reproduce the gaps in master with this board. Workers and the main thread pin appear to be fine. After the PR all is the same.

Note: Capture files can be downloaded here

@Josar
Copy link
Contributor Author

Josar commented Nov 2, 2018

@leandrolanzieri with a 32bit hardware timer running with 1MHz this proplem may happen once in ca. 71 minutes, if i am not mistaken. To provoke this error you could let the timer run faster then 1MHz.

Besides that one of those fixes only affects not 32bit timers.

@cladmi
Copy link
Contributor

cladmi commented Nov 3, 2018

I really enjoy the upgrade documentation and improvement.

However I do not like that it was merged only on testing without reviewing the documentation and consequence out what was written there.
Many of the things put can in fact not be ensured in a multi-threaded environment.

I tried to prove my assumptions with #10321 please show me if my tests are wrong to say my critics are wrong.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: timers Area: timer subsystems CI: ready for build If set, CI server will compile all applications for all available boards for the labeled PR CI: run tests If set, CI server will run tests on hardware for the labeled PR Type: bug The issue reports a bug / The PR fixes a bug (including spelling errors) Type: enhancement The issue suggests enhanceable parts / The PR enhances parts of the codebase / documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.