Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After upgrading from 5.6.1 to 5.7.0 there are random restarts due to "Hardware watchdog" #853

Closed
wiktorschmidt opened this issue Sep 9, 2017 · 13 comments

Comments

@wiktorschmidt
Copy link

On multiple devices I own (sonoff, sonoff 4ch pro, wemos)

@arendst
Copy link
Owner

arendst commented Sep 9, 2017

Cannot reproduce:

Sep  9 19:01:59 wkaku10 ESP-MQT: tele/wkaku10/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:01:59 led2 ESP-MQT: tele/led2/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:01:59 rf1 ESP-MQT: tele/rf1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:01:59 wkaku7 ESP-MQT: tele/wkaku7/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 pow1 ESP-MQT: tele/pow1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 wkaku6 ESP-MQT: tele/wkaku6/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 wkaku8 ESP-MQT: tele/wkaku8/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 wkaku11 ESP-MQT: tele/wkaku11/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 b1 ESP-MQT: tele/b1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":26}
Sep  9 19:02:00 ring1 ESP-MQT: tele/ring1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":26}
Sep  9 19:02:00 4ch1 ESP-MQT: tele/4ch1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 dual ESP-MQT: tele/dual/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:00 led1 ESP-MQT: tele/led1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 pow2 ESP-MQT: tele/pow2/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 th1 ESP-MQT: tele/th1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 touch1 ESP-MQT: tele/touch1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 sonoff3 ESP-MQT: tele/sonoff3/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 s20 ESP-MQT: tele/s20/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 wkaku9 ESP-MQT: tele/wkaku9/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:01 bridge1 ESP-MQT: tele/bridge1/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}
Sep  9 19:02:02 t1 ESP-MQT: tele/t1/UPTIME = {"Time":"2017-09-09T18:02:00", "Uptime":49}
Sep  9 19:02:02 wkaku12 ESP-MQT: tele/wkaku12/UPTIME = {"Time":"2017-09-09T19:02:00", "Uptime":49}

I only see Blocked Loop on Sonoff SC...

Edit: Blocked Loop was due to low wifi signal. Selecting a nearer AP solved it.

@wiktorschmidt
Copy link
Author

Theo,

Can't really offer any more insight at this stage as I had to roll back all devices back to 5.6.1.

It's probably not related but the logs changed from showing "ESP-MQTT" to "ESP-MQT".

On 5.6.1 I now see some crashes as well (but only on wemos):

Sep 09 21:44:13 xxx logger: wemos-8115 ESP-MQTT: tele/wemos/INFO3 = {"Started":"Fatal exception:28 flag:2 (EXCEPTION) epc1:0x4020dcec epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000004 depc:0x00
Sep 10 04:09:11 xxx logger: wemos-8115 ESP-MQTT: tele/wemos/INFO3 = {"Started":"Fatal exception:28 flag:2 (EXCEPTION) epc1:0x4020dcec epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000004 depc:0x00
Sep 10 08:52:56 xxxx logger: wemos-8115 ESP-MQTT: tele/wemos/INFO3 = {"Started":"Fatal exception:28 flag:2 (EXCEPTION) epc1:0x4020dcec epc2:0x00000000 epc3:0x00000000 excvaddr:0x00000004 depc:0x00

@arendst
Copy link
Owner

arendst commented Sep 10, 2017

Log change is due to i18n where I changed log type from 4 to 3 characters. You can always make a change to your desired languga file.

Could you provide your status 0 output for the current wemos device.

Do you use the default images or compile yourself? I ask as I can only debug exceptions with the correct used source. If exception occurs you could use the stack trace to pinpoint the source area.

@wiktorschmidt
Copy link
Author

Got it.

Sure:
"{"Status":{"Module":18, "FriendlyName":"Sonoff", "Topic":"wemos", "ButtonTopic":"0", "Power":0, "PowerOnState":3, "LedState":1, "SaveData":1, "SaveState":1, "ButtonRetain":0, "PowerRetain":0}}

It's the default http://sonoff.maddox.co.uk/tasmota/sonoff-ds18x20.bin build

@arendst
Copy link
Owner

arendst commented Sep 11, 2017

I downloaded and Installed 5.6.1 sonoff-ds18x20.bin from my github on my wemos and executed command status 2 resulting in a software watchdog timeout...

I compiled 5.6.1 using Visual Studio / platformio and installed (sonoff-ds18x20) firmware.bin on my wemos and executed command status 2 resulting in exception 2 = software watchdog ...

I then compiled the same code with parameter USE_DS18x20 enabled using arduino and installed sonoff.ino.bin on my wemos. Executed command status 2 without any problem.

Comparing the binaries shows differences from the beginning of the code. This could be caused by compiler optimizations and linker differences between platformio and arduino. I always use arduino compiled code at home and that explains why I do not see the problems you are reporting.

Edit: I also changed the arduino libraries to the exact same libraries in the lib folder on my github, compiled using arduino and did not receive any exception.

Edit: Forgot to change PubSubClient specific parameter MATT_MAX_PACKET_SIZE to 758 and MESSZ to 600 for Arduino. After those changes the Arduino fails as well. Did fool around with other sizes and at one time made the status data global at which time the mqtt response went fine but it was still unstable.

I like to conclude that lack of stack space is causing the Watchdog timeouts during mqtt publish routine.

Pls try to use the basic sonoff.bin file and see what happens.

@wiktorschmidt
Copy link
Author

That makes sense. Thanks for investigating! I'm actually using up to 8 DS18x20 sensors, so would need those sizes larger unless there is a way to split the tele/+/SENSOR message? Is it possible to split sensor output into say 5 per MQTT msg?

@arendst
Copy link
Owner

arendst commented Sep 11, 2017

I've just finished a rewrite where the MQTT data is global for all routines. First tests are very promising; no exceptions.

I will further investigate where the boundaries lie now...

@wiktorschmidt
Copy link
Author

Thanks! Really appreciate it!

arendst added a commit that referenced this issue Sep 13, 2017
5.7.1d
 * Fix watchdog timeout caused by lack of stack space by now
using global buffers (#853)
 * Removed webserver syslog disable as now
no longer needed
 * Increased default MQTT message size from 368 to 405
bytes while keeping DMQTT_MAX_PACKET_SIZE = 512 (because we can)
 * Fix
MQTT Offline or Remove MQTT retained topic code
 * Allow command
SwitchTopic in group mode (#861)
 * Allow command SwitchMode if no
switches are defined (#861)
arendst added a commit that referenced this issue Sep 18, 2017
5.8.0 20170918
 * Remove the need for NeoPixelBus library for Hue
support
 * Consolidate WS2812 into Sonoff Led for flexible future led
strip library changes
 * Invert WS2812 fade speed to align with Sonoff
led (Speed 1 = fast, Speed 8 = slow)
 * Remove upper case MQTT receive
buffer
 * Reduce code and string length for output of commands Modules
and GPIOs
 * Add Sonoff SC debug information
 * Change syslog service
 *
Removed webserver syslog disable as now no longer needed
 * Increased
default MQTT message size from 368 to 405 bytes while keeping
MQTT_MAX_PACKET_SIZE = 512 (because we can)
 * Fix MQTT Offline or
Remove MQTT retained topic code
 * Fix Domoticz loop when Emulation is
selected
 * Add blink to WS2812 and Sonoff Led (#643)
 * Add option
WIFI_WAIT (5) to command WifiConfig to allow connection retry to same AP
without restart or update flash (#772, #869)
 * Add support for Witty
Cloud (#794)
 * Add GPIO14 to Sonoff Dual (#797, #839)
 * Add support
for Yunshan Wifi Relay (#802)
 * Add GPIO16 input pulldown (#827)
 * Add
timeout to DHT and DS18B20 sensors (#852)
 * Fix watchdog timeout caused
by lack of stack space by moving to heap (#853)
 * Allow LogPort and
MqttPort up to 65535 and add LogPort tot Status 3 (#859)
 * Allow
command SwitchTopic in group mode (#861)
 * Allow command SwitchMode if
no switches are defined (#861)
 * Add optional dimmer parameter to
command Wakeup for WS2812, AiLight, Sonoff B1, Led and BN-SZ01 (#867)
 *
Fix basic On, Off, Toggle, Blink and BlinkOff commands when other
language is selected (#874)
@stefanbode
Copy link
Contributor

Hi arend, did you remember my post regarding the timeout issue in ‚your‘ Pubsub version? I changed my one and since then all the blocking problem caused by mqtt are gone. Also startup time is now less than 6 seconds incl. status send to mqtt.

@stefanbode
Copy link
Contributor

Please insert in the PubSubClient.cpp at line 212 a yield(); and recompile from scratch. Known issue.

// reads a byte into result
boolean PubSubClient::readByte(uint8_t * result) {
uint32_t previousMillis = millis();
while(!_client->available()) {
yield();
uint32_t currentMillis = millis();
if(currentMillis - previousMillis >= ((int32_t) MQTT_SOCKET_TIMEOUT * 1000)){
return false;
}
}
*result = _client->read();
return true;
}

@arendst
Copy link
Owner

arendst commented Sep 23, 2017

@stefanbode To keep pubsubclient transparrent for both esp and atmega can you please verify that instead of yield() a delay() works just as fine?

esp delay should also perform a yield according to esp-arduino documentation and should make integration in pubsubclient a lot easier for knollery..

Edit: removed value from delay() as not needed:

void delay(unsigned long ms) {
    if(ms) {
        os_timer_setfn(&delay_timer, (os_timer_func_t*) &delay_end, 0);
        os_timer_arm(&delay_timer, ms, ONCE);
    } else {
        esp_schedule();
    }
    esp_yield();
    if(ms) {
        os_timer_disarm(&delay_timer);
    }
}

@stefanbode
Copy link
Contributor

Delay(1) is also fine. It is just to avoid the blocking of the loop.

@arendst
Copy link
Owner

arendst commented Sep 25, 2017

@stefanbode See #790 and let's continue discussion there if needed.

@arendst arendst closed this as completed Sep 25, 2017
curzon01 pushed a commit to curzon01/Tasmota that referenced this issue Sep 6, 2018
5.7.1d
 * Fix watchdog timeout caused by lack of stack space by now
using global buffers (arendst#853)
 * Removed webserver syslog disable as now
no longer needed
 * Increased default MQTT message size from 368 to 405
bytes while keeping DMQTT_MAX_PACKET_SIZE = 512 (because we can)
 * Fix
MQTT Offline or Remove MQTT retained topic code
 * Allow command
SwitchTopic in group mode (arendst#861)
 * Allow command SwitchMode if no
switches are defined (arendst#861)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants