Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1.5.0 Stability issues when boiler is not supporting enabled sensors #115

Open
SanFable opened this issue Dec 28, 2024 · 45 comments
Open

1.5.0 Stability issues when boiler is not supporting enabled sensors #115

SanFable opened this issue Dec 28, 2024 · 45 comments
Labels
bug Something isn't working

Comments

@SanFable
Copy link

SanFable commented Dec 28, 2024

Hello,

I’ve been using OTGateway with my Beretta Ciao Green 25 C.S.I. boiler for about a year. It works fine with smart TRVs and Home Assistant, although the functionality is somewhat limited.

Recently, I updated to version 1.5.0 and switched to an ESP32-S3 (I also tried the ESP32-C3 with the same results). Previously, I was using version 1.4.5 with an ESP8266.

The Issue
After the update, the system became unstable:

  • After about a minute, the OpenTherm Gateway status shows "problematic," and everything becomes unavailable.

  • The ESP32-S3 seems to struggle:

  • The web interface is very slow and unresponsive.

  • Logs over Telnet are delayed and limited.

  • With the ESP32-C3, the system was completely unreachable.

What I Found
I think the issue is caused by new sensors/IDs, such as:

  • Return temperature
  • Flow rate
  • Exhaust temperature
  • Pressure
  • Minimum modulation
  • Maximum power

In the logs, I saw warnings like:
[WARN] Failed to receive ...

It looks like the boiler doesn’t support these IDs, and they might be overloading the OpenTherm communication, causing it to crash.

What Worked
I turned off these new sensors and reconnected the OTGateway to the boiler. Since then, it’s been running stable for over 30 minutes, and the OpenTherm Gateway status is "OK."

Attaching logs_1.txt from a setup where I didn’t disable all the mentioned sensors (minimum modulation and maximum power were still enabled).
In the end, the ESP32 crashed, and the Telnet connection was lost.

I just realized that even after turning off the mentioned sensors in my successful run, I’m still seeing invalid request IDs 15 and 14 (minimum modulation, maximum power, and maximum modulation).

Attaching logs_2.txt from over 30 minutes of stable operation. However, there are still warnings in the logs from sensors that should be disabled. Could it be that something is overlapping when these sensors are enabled?

EDIT:
After about 6h I had few reconnections:
image

PS is there anything that I could do to improve support for my boiler?

@tincanpete
Copy link

I believe I'm having the same issue. The symptoms are the same, and I have also recently upgraded to 1.5.0 but have not been able to get it running in any stable or reliable way.

I think the issue might have something to do with MQTT as I get better results with it turned OFF but have not absolutely verified this yet.

@Daveblanche
Copy link

Hi both,
Quick question, but did you delete the HA instance of OpenTherm gateway on mqtt before updating the firmware or chipset. Yuri warned of potential errors, if this was not done.

@tincanpete
Copy link

Thanks @Daveblanche for the tip, but yes I definitely did do that as recommended. I think the issue is on the OT Gateway end not the HA end though. I'm going to continue to try to figure this out and will post more here as I make progress.

@SanFable
Copy link
Author

@Daveblanche
Before setup I have went to HA settings ->devices and services->mqtt->devices, selected opentherm and removed it. Then after installing new one I removed and added new cards in dashboard.

Regarding my stability, I had few reconnections yesterday, but at 23:00 I disabled logging (serial and telnet) and it failed only once a whole day for 40 seconds.

@Laxilef
Copy link
Owner

Laxilef commented Dec 30, 2024

Hi guys,

I use S3 myself and tested the project on C3, but I can't reproduce the problem. Indeed, the web works slower when there is no connection to the boiler via openterm, and I will try to fix this.

As for the fact that polling some IDs breaks the bus - I don't know why this could be. Perhaps there is some kind of bug in the boiler firmware. In the logs I did not see a poll of these IDs and loss of connection via OT.

If you have more information it will help.

@Laxilef
Copy link
Owner

Laxilef commented Dec 31, 2024

Anyone who has problems with losing connection, test this build.
1.5.1-testing.zip

And happy holidays!

@tincanpete
Copy link

Thank you @Laxilef I will give it a try!

@SanFable
Copy link
Author

SanFable commented Dec 31, 2024

@Laxilef @tincanpete

I have tested 1.5.1,
I have setup it like normal, kept telnet logs enabled, for 20mins I stayed with default settings and warns because boiler low support (just filled the mqtt section) and everything was stable. After that I have disabled sensors that are not existing in my boiler (thats shame its lacking some useful sensors...)

no reconnection since 2 hour, it works now. Page is responsive. No complaints. This state was not achievable on 1.5.0, Good job :)

I'm interested about @tincanpete feedback.

attaching logs.txt, maybe something minor or expected, I have few warnings.

edit,

I hit ctrl c in telnet terminal (my bad lol), then closed putty and wanted to turn it off in settings and page is laggy, problem detected in HA. RIP lol.

After turning off telnet and restarting the ESP everythings is laggy again, so issue is still open :/

after 2 mins since boot it might went back to normal (but not sure, like 70% of the responsivity)?, I will see if I got any reconnections

I think we need more detailed logs

I have swapped to c3 and its about not usable. so it looks like s3 barely handles the problem, where c3 not.
I have swapped in same way to s3 and now its snappy as it should, no idea whats going on. I will wait with this state for reconnections

@tincanpete
Copy link

Hello, I have also just tested 1.5.1 and while I thought there was some initial success, its seems no. Having MQTT enabled definitely makes things significantly worse.

I have attached a file showing repeated 'ping' to the board. When it's working well, I always see about 10-50ms. However as you can see it quickly becomes unstable and unresponsive; and will eventually sort-of come back to life but it seems quite random.

During the time when ping is very slow or dropped, the UI is also unresponsive, and if connected, the MQTT server will report the device is off-line. Sometimes my boiler will also report an OpenTherm communication error on its display.

When the problem goes away, everything goes back to normal and works OK, but often not long enough to be useful.

I'm using S2-mini board, and this ping trace was done with the gateway serial port, telnet, and logging all turned OFF to try and make sure that high level logging wasn't causing high load to be part of the problem.

ping example.txt

Thanks for your help!

@SanFable
Copy link
Author

SanFable commented Dec 31, 2024

@tincanpete
try following:
download current settings backup
reflash whole esp32 using PC (flash factory image and filesystem like new one)
when connecting to esp32 for the first time restore settings.
connect esp32 to the boiler. After that I have setup settings that I wanted and its working OK (2 hours now)

In my case esp32-s3 seems fine, but when I changed to weaker esp-c3 it was nightmare, just not working. I guess I would have similar results as you.

If you still have problems maybe try disabling sensors that are not available on your boiler (to minimize warnings in logs)

I belive my pings.txt are fine, its wifi with signal 68-76% reported on OTG page.

@tincanpete
Copy link

@SanFable I will try that, I did not use the "factory" image, just the normal one and upgraded via the UI.
I will report back my progress!
Thanks

@tincanpete
Copy link

tincanpete commented Jan 1, 2025

Still no joy unfortunately even after fully erasing and re-flashing the S2 and using the factory bin.

I have attached a larger log but just look at this extract from the end:

[00:02:56][OT][DHW][NOTICE] Received flow rate: 0.00 (converted: 0.00)
[00:02:56][SENSORS][NOTICE] #6 'DHW flow rate' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:02:56][OT][HEATING][NOTICE] Received temp: 10.00
[00:02:56][SENSORS][NOTICE] #2 'Heating temp' new value 0: 10.00, compensated: 10.00, raw: 10.00
[00:02:57][OT][HEATING][NOTICE] Received return temp: 10.60 (converted: 10.60)
[00:02:57][SENSORS][NOTICE] #3 'Heating return temp' new value 0: 10.60, compensated: 10.60, raw: 10.60
[00:02:58][OT][NOTICE] Received exhaust temp: 10.00 (converted: 10.00)
[00:02:58][SENSORS][NOTICE] #7 'Exhaust temp' new value 0: 10.00, compensated: 10.00, raw: 10.00
[00:02:58][OT][NOTICE] Received pressure: 0.80 (converted: 0.80)
[00:02:58][SENSORS][NOTICE] #8 'Pressure' new value 0: 0.80, compensated: 0.80, raw: 0.80
[00:02:59][OT][NOTICE] Received boiler status. Heating: 0; DHW: 0; flame: 0; cooling: 0; fault: 0; diag: 0
[00:02:59][SENSORS][NOTICE] #9 'Modulation level' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:02:59][SENSORS] #10 'Power' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:03:11][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:11][OT][DHW][NOTICE] Received temp: 11.00 (converted: 11.00)
[00:03:18][SENSORS][NOTICE] #5 'DHW temp' new value 0: 11.00, compensated: 11.00, raw: 11.00
[00:03:28][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:37][OT][DHW][NOTICE] Received flow rate: 0.00 (converted: 0.00)
[00:03:38][SENSORS][NOTICE] #6 'DHW flow rate' new value 0: 0.00, compensated: 0.00, raw: 0.00
[00:03:44][OT][HEATING][NOTICE] Received temp: 10.20
[00:03:53][SENSORS][NOTICE] #2 'Heating temp' new value 0: 10.20, compensated: 10.20, raw: 10.20
[00:03:55][SENSORS][NOTICE] #4 'Heating setpoint temp' new value 0: 30.00, compensated: 30.00, raw: 30.00
[00:03:55][OT][HEATING][NOTICE] Received return temp: 10.80 (converted: 10.80)
Connection closed by foreign host.

Note the time stamps from 2:59 onwards, there's a big delay between each one.

At the same time, the pings to the board are looking like this screenshot:

image

Full log file below:
1.5.1 log.txt

All of this was with MQTT turned OFF by the way, which I thought would be better, but did not help actually.

thanks!
Pete

@tincanpete
Copy link

To add to my previous comment, I did try disabling the "power" sensor which my boiler does not support, but it didn't seem to make any difference.

@SanFable
Copy link
Author

SanFable commented Jan 1, 2025

From my side, I had 2 reconnections every 24h using esp32s3. I assume esp32 s3 is powerful enough to handle something bad thats going on in the background. C3 was not usable.

@Laxilef
Copy link
Owner

Laxilef commented Jan 1, 2025

Guys, I can assume that the problem may be in the router. Let's check it: check how the web interface works when connected to the ESP access point, i.e. when the network is not yet configured on the ESP.

P.S. What routers do you use? Can you disable telnet and check the web? To view the logs at this moment, you can use the serial port.

@SanFable
Copy link
Author

SanFable commented Jan 1, 2025

I'm using ubiquiti U7 pro. Everything was fine on previous versions (1.4.5 in my case with mini D1).

When first time configuring connected to ESP access point web interface is blazing fast. I will try to differ if it slows down after connecting to boiler or not.

UniFi controller doesn't show me problems, it says wifi experience excellent (95%+), some spikes to good (88%)

@Laxilef
Copy link
Owner

Laxilef commented Jan 1, 2025

Everything was fine on previous versions (1.4.5 in my case with mini D1).

Test 1.5.0 on your D1 mini. Now you are testing on ESP32, these are different boards and there is a different SDK.

upd: If you are using mesh and multiple access points, this may not work correctly with ESP. I don't know why, but sometimes it happens. And I don't recommend using 2G and 5G APs with the same SSID.

@tincanpete
Copy link

Interesting idea, I had considered it might be a wifi/router issue, however the network is stable and has been for a long time, running Home Assistant and many other ESP-based devices (Shelly Relays and similar) without a problem. Do you think there's a chance my Wemos S2-mini board just "doesn't like" the wifi network? The problem was present, although not as severe, with software 1.4.5.

@Daveblanche
Copy link

I’m on UniFi, too. Check your retry rates on the front page of UniFi network app. I had an issue a few weeks ago, with high retries, and it was down to channel choice/availability.

My network is incredibly stable, too.

@Laxilef
Copy link
Owner

Laxilef commented Jan 1, 2025

@SanFable
Copy link
Author

SanFable commented Jan 1, 2025

I just have only one U7 Pro, no meshing.
image
TX retries, I would say opentherm is in the very middle of the devices. I just have noisy 2.4ghz. I'm on channel 1 20mhz which unifi auto choosen. I will experiment with others and check TX rate.

I will try d1 mini tomorrow.

@Laxilef
Copy link
Owner

Laxilef commented Jan 2, 2025

ESP32 C3 connected to mikrotik, OT not connected

esp.c3.webm

@tincanpete
Copy link

I've not had a chance to try with Wireless AP only as the hardware is running in a shop we own and I've not been there for a couple of days. However, remotely monitoring it I have just noticed the "Uptime" on the UI homepage has been reset and the "Last Reset Reason" is showing "Reset due to other watchdogs". Does this offer any clues to you?

@Laxilef
Copy link
Owner

Laxilef commented Jan 4, 2025

Does this offer any clues to you?

Screenshot_9

@tincanpete
Copy link

No, "save debug data" just gives me this:


{
  "build": {
    "version": "1.5.1-testing",
    "date": "Dec 31 2024 02:02:27",
    "env": "s2_mini",
    "core": "3.1.0",
    "sdk": "v5.3.2-174-g083aad99cf-dirty"
  },
  "heap": {
    "total": 188452,
    "free": 55288,
    "minFree": 48052,
    "maxFreeBlock": 30708,
    "minMaxFreeBlock": 25588
  },
  "chip": {
    "model": "ESP32-S2",
    "rev": 100,
    "cores": 1,
    "freq": 240
  },
  "flash": {
    "size": 4194304,
    "realSize": 4194304
  },
  "crash": {
    "reason": "Reset due to other watchdogs",
    "core": 0,
    "heap": 58416,
    "uptime": 680780475
  }
}

Under what circumstances will Watchdog cause a reboot?

@Laxilef
Copy link
Owner

Laxilef commented Jan 4, 2025

Hmm, strange, there is no backtrace in the debug data. Without a backtrace it is impossible to find out the reason.
There may be many reasons, sometimes it is related to poor power supply of the ESP.

@Symon84
Copy link

Symon84 commented Jan 5, 2025

Hi, I have a similar issue with system instability (frequent disconnections).
The system seems to get saturated.
This happens exactly after modifying the values related to Emergency mode.

By default, I have these parameters:

Target temperature: 40
Threshold time: 120

My system is configured with a minimum flow temperature of 50 degrees.
If I change the target temperature to 50 degrees in Emergency mode and set the threshold to 120, the system stops working, becomes unstable, connects and disconnects continuously, and constantly activates Emergency mode.

I’ve tried this three times (always with firmware 1.5.0), and the issue has replicated every time.
The only way to restore the system is to erase the firmware and flash it again.

If I leave the Emergency mode values unchanged (40°C and 120 seconds), everything works correctly.

I hope this can help.
Thank you so much for the amazing work!

@Symon84
Copy link

Symon84 commented Jan 5, 2025

Never mind… the problem has now reappeared even without modifying the parameters.
I’ll try replacing the ESP8266 (D1 Mini) to see if it’s a hardware issue. I’ll keep you updated.
For the record, it worked perfectly for four consecutive days.

The disconnection issue occurs even if the device is not connected to the boiler.

@Laxilef
Copy link
Owner

Laxilef commented Jan 5, 2025

If your ESP is powered via USB, try replacing the power supply with a different one.

@Symon84
Copy link

Symon84 commented Jan 7, 2025

If your ESP is powered via USB, try replacing the power supply with a different one.

Initially, the D1 mini was connected with an external stabilized power supply (via 5V pin).
I tried replacing the power supply with a USB type power supply but the problem persists.
I also tried to change the D1 mini with a new one, but I still have the same problem.
Today I give it a try by disabling the 5G wifi network, but I have many other devices (including D1 mini) connected, which never have data connection problems.

@Laxilef
Copy link
Owner

Laxilef commented Jan 8, 2025

Guys, I'm not saying there is no problem. But I don't know what causes the problem and how to fix it because I can't reproduce it.

Perhaps for some reason the router is disconnecting the client. I think that you need to compile the firmware with core logs and see what is happening in more detail via COM port.

Example of additional build_flags for ESP8266:
-D DEBUG_ESP_CORE -D DEBUG_ESP_WIFI -D DEBUG_ESP_PORT=Serial

@Laxilef Laxilef added the bug Something isn't working label Jan 8, 2025
@Symon84
Copy link

Symon84 commented Jan 8, 2025

Ok, I’ve run all the tests I could, including using version 1.5.1 and disabling the 5G WiFi network, but I haven’t seen any improvement.
I’ll try setting up an S2 mini to see if things get better.

Sorry for the question, but on the release page, what’s the difference between these two firmware files:

firmware_s2_mini_1.5.1.bin
firmware_s2_mini_1.5.1.factory.bin
Thank you very much!

@Daveblanche
Copy link

Daveblanche commented Jan 8, 2025 via email

@Laxilef
Copy link
Owner

Laxilef commented Jan 8, 2025

Ok, I’ve run all the tests I could, including using version 1.5.1 and disabling the 5G WiFi network, but I haven’t seen any improvement.

Maybe you have some other wifi router to compare with it?

Sorry for the question, but on the release page, what’s the difference between these two firmware files:

firmware_s2_mini_1.5.1.bin firmware_s2_mini_1.5.1.factory.bin

Factory for flashing via esptool, not factory for OTA.

I’m making a massive assumption that this plugin is based on esphome?

No, we are talking about the firmware from this repository :)

@Symon84
Copy link

Symon84 commented Jan 8, 2025

Maybe you have some other wifi router to compare with it?

Yes, but replacing the router is not a simple operation (I have many devices connected, including 4 D2 mini that have been working without problems for about a year..)
I keep this option as the "final" test..

Factory for flashing via esptool, not factory for OTA.

Ok! I've tried with an wemos s2 mini, but I have the same problem..
I've also disabled telnet, serial port and log without any improvement..

As soon as I have some free time I will do more tests..

@Laxilef
Copy link
Owner

Laxilef commented Jan 9, 2025

Yes, but replacing the router is not a simple operation (I have many devices connected, including 4 D2 mini that have been working without problems for about a year..)
I keep this option as the "final" test..

You can just turn on another router, connect the ESP and your computer to it. You do not need to configure Internet access for this router. To test, you don't need to change the router for all devices :)

@Symon84
Copy link

Symon84 commented Jan 9, 2025

You can just turn on another router, connect the ESP and your computer to it. You do not need to configure Internet access for this router. To test, you don't need to change the router for all devices :)

In this case I will try as soon as possible and I will check through a ping, even if it will not be connected to home Assistant via MQTT. Thanks!

@Symon84
Copy link

Symon84 commented Jan 10, 2025

Guys, I found the issue in my case.
Even though the Wemos is installed just 2 meters away from the router and the RSSI signal is -54 dBi, the frequent disconnections are due to the placement of the device (near the boiler).
The boiler has a metal casing, which probably interferes with the signal in some way, even though the Wemos is close to it.
I’ll have to modify the Wemos by adding an external antenna.

It didn’t occur to me immediately because I have other D1 Minis inside walls, near mains voltage, and even 8 meters away from the router, which have no issues at all. Probably, the area near the boiler (even though very close to the router) is subject to interference.

In the following image, you can see how moving the D1 changes the situation: the red zone is where it was near the boiler, while even moving it just a few centimeters makes the connection stable enough.
image

@Laxilef
Copy link
Owner

Laxilef commented Jan 10, 2025

Well, if you placed the ESP inside the boiler, then you made a shielded box for it 😄
In fact, there are many electronics in the boiler that can interfere with the ESP if the ESP is located near the boiler electronics. For example, the ignition works from high voltage, which can sometimes even interfere with the bus.

Now I wonder if other users from this issue have the same reason or not.

@SanFable
Copy link
Author

I have installed my esp32 on the bottom of the boiler (not inside a case), previously had wemos d1 mini with 1.4.6 without any problems.

wemos s3 mini:
current state:
page is not 100% fast like it was before connecting to a boiler.
i have 1 or 0 disconnections daily, this one is from today and 1 minute before boiler started heating.
image

second disconnect happend about 8minutes since heating started
image

but third disconnect happend totally randomly, when everything was shut off
image

foruth (on 4 jan) offline was for over minute.. interesting.
image

PS
I moved wifi channel from 1 to 6 and have less TX Retries, FYI, unifi auto selection seems bad :)
image

sorry for late reply, I will try wemos d1 this weekend, had a lot of work.

@Laxilef
Copy link
Owner

Laxilef commented Jan 10, 2025

@SanFable can you try moving the ESP further away from the boiler?

@SanFable
Copy link
Author

Its glued and powered by meanwell power supply, I will try to disconnect it from shield and move about meter away (i hope that meter won't affect OT communication)

@Symon84
Copy link

Symon84 commented Jan 10, 2025

Well, if you placed the ESP inside the boiler, then you made a shielded box for it 😄

In fact, there are many electronics in the boiler that can interfere with the ESP if the ESP is located near the boiler electronics. For example, the ignition works from high voltage, which can sometimes even interfere with the bus.

Now I wonder if other users from this issue have the same reason or not.

😅 No, the ESP is outside the boiler, about 20 cm away inside a plastic box, but it is probably still too close to the boiler that somehow interferes with wireless communication.

@Symon84
Copy link

Symon84 commented Jan 15, 2025

I need to revise my previous statement. The connection actually remained stable for almost two days, but then the same issues resurfaced. I tried moving the device closer to the router, but the problem persists. I downgraded to version 1.4.6, but this didn’t bring any improvements.

I’m investigating other possible causes. It seems that the Wemos gradually slows down over time, eventually causing intermittent Wi-Fi disconnections. I’ll share an update if I find a solution to the issue.

@Laxilef
Copy link
Owner

Laxilef commented Jan 16, 2025

It seems that the Wemos gradually slows down over time, eventually causing intermittent Wi-Fi disconnections.

In case of memory leaks, ESP restarts would occur.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants