Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thing manager logic causes things to stuck in NOT_YET_READY #3823

Open
splatch opened this issue Oct 3, 2023 · 44 comments · May be fixed by #4444
Open

Thing manager logic causes things to stuck in NOT_YET_READY #3823

splatch opened this issue Oct 3, 2023 · 44 comments · May be fixed by #4444
Labels
bug An unexpected problem or unintended behavior of the Core

Comments

@splatch
Copy link
Contributor

splatch commented Oct 3, 2023

I'm in process of updating bindings to OH 4 and spotted a bug, which I believe is related to new logic which improves reliability and predictability of thing startup process.

Bindings I have are rather basic in their form, one which I test with use no dynamic (generated) channels or thing types. Its entirely provided through XML descriptors. However, OH 4.0 makes it stuck in "NOT_YET_READY" state, despite of all descriptors being available and valid.

Expected Behavior

Valid thing types should cause handlers to be initialized.

Current Behavior

Despite of thing descriptors being properly parsed and available thing is marked as not ready, because its handler is refused.
I've traced it to a logic which populated "missing prerequisites". It turns that this collection is being feed in few places, but released only in one. This means that depending on startup order (and performance of machine), some handlers may never be used, because thing manager will miss opportunity to remove missing prerequisites, even if these conditions are met.
Please see below screenshot where missing prerequisites contains co7io-amsads:network, but thing thing type is already registered in appropriate registry:

image

Steps to Reproduce (for Bugs)

So far I was not able to get a reproducer, as it is strictly speaking, dependent to startup order. If thing is added before thing descriptor is parsed - there is high probability that ThingManager will never update solved prerequisites.

Your Environment

  • openHAB 4.0.3
  • JVM Temurin 17
  • Linux 6.5
@splatch splatch added the bug An unexpected problem or unintended behavior of the Core label Oct 3, 2023
@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/creating-a-new-thing-in-own-developed-binding-works-but-after-restart-of-demo-app-the-thing-is-missing-the-thinghandler/159325/8

@lolodomo
Copy link
Contributor

lolodomo commented Oct 6, 2024

@J-N-K : maybe it "talks" to you?

@J-N-K
Copy link
Member

J-N-K commented Oct 6, 2024

I wonder why addThingHandlerFactory does not properly remove the missing thingType from the ThingPrerequisites. Can you enable debug logging for the ThingManagerImpl and see in which order the things and the thing handler factory get added?

@pashalaev
Copy link

pashalaev commented Nov 3, 2024

On my system, this behavior is caused by the fact that when the ItemRuntimeActivator parser is added, the FolderObserver component is not activated
изображение
Therefore ReadyMarker dsl:items is missing -> startlevel does not reach 20 -> no ReadyMarker for run ThingManagerImpl.onReadyMarkerAdded -> no schedule for checkMissingPrerequisites

After executing bundle:restart 200 (200 - openHAB Core :: Bundles :: Model Items Runtime), openhab goes into a working state.

@J-N-K
Copy link
Member

J-N-K commented Nov 3, 2024

It's fine that the FolderObserver is not activated when the ModelParser is added. The ready marker is correctly added later After the FolderObserver is activated.

We could run into issues when the FolderObserver is never activated. Is that the case in your setup?

@pashalaev
Copy link

FolderObserver is already activated when the next parser is added - PersistenceRuntimeActivator.
изображение
But at the same time, openhab will never reach startlevel 20. Because when ItemRuntimeActivator was added, FolderObserver was not activated. Openhab forever remains at startlevel 10 and only restarting openHAB Core :: Bundles :: Model Items Runtime allows you to reach the next startlevel

@pashalaev
Copy link

Openhab goes into an endless loop checking the dsl:items marker is always missing
изображение

@J-N-K
Copy link
Member

J-N-K commented Nov 3, 2024

But have a look at the activate method, the only place where activated ist set (l. 143): BEFORE all models that already have a ModelParser are added (l. 142) and ready markers are added for them (l. 205), AFTER the service is activated directories that are not yet processed are checked again, if a ModelParser is available, they are added and the ready markers set (l. 150).

Can you check whether the items ModelParser is available when these lines are processed?

@pashalaev
Copy link

pashalaev commented Nov 3, 2024

(l.205) Will only be executed if the openhab/conf/items folder is present.

So, for openhab to launch correctly, the openhab/conf/items folder must be present. If it is not there, then the problems will be described above. My system does not have this folder, items are stored in the file openhab/userdata/jsondb/org.openhab.core.items.Item.json. After an empty folder has been created. All things changed the status from NOT_YET_READY to Online the next time openhab was launched.

@lolodomo
Copy link
Contributor

lolodomo commented Nov 3, 2024

for openhab to launch correctly, the openhab/conf/items folder must be present

At my knowledge, this folder is there when you install OH.
Did you remove it at a certain time?

@lolodomo
Copy link
Contributor

lolodomo commented Nov 3, 2024

Just unzip the ZIP file of any release to verify.

@pashalaev
Copy link

pashalaev commented Nov 3, 2024

The folder is missing because it is running via docker-compose

version: '3.9'
services:

openhab:
image: openhab/openhab:4.3.0.M2
container_name: openhab
restart: always
volumes:
- /opt/openhab/conf/.ssh:/openhab/.ssh
- /opt/openhab/addons:/openhab/addons
- /opt/openhab/conf:/openhab/conf
- /opt/openhab/userdata:/openhab/userdata
ports:
- "8081:8080"
- "8101:8101"
- "8443:8443"
- "5555:5555"
- "5005:5005"
environment:
- EXTRA_JAVA_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=*:5005

@lolodomo
Copy link
Contributor

lolodomo commented Nov 3, 2024

The folder is missing because it is running via docker-compose

@J-N-K WDYT ? We should just update documentation for docker setup ?

@splatch : were you in this case on you side?

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/things-not-yet-ready-after-restarting-the-demo-app-in-eclipse/160123/1

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/openhab-4-3-milestone-discussion/158139/58

@lo92fr
Copy link

lo92fr commented Nov 10, 2024

Hello,

I was wondering about the state of this issue.
Is this issue will be fixed before the official 4.3 release ?
It seems really a blocker for this new release as the addons not initialized correctly when not defined in a .things file.

Thanks,
Laurent.

@J-N-K
Copy link
Member

J-N-K commented Nov 10, 2024

Do you have an empty conf directory (like @pashalaev)?

@lo92fr
Copy link

lo92fr commented Nov 10, 2024

No, on my side the conf directory is there.
I've test it on my production, and developement environment as well, and the result is the same.
I've trigger it when working on the linky binding development.

I've setup openhab core on the dev environement, and see the same behaviour. missing prerequisite map is populated, preventing the binding to start correctly. The OnReadyMarker is never called, so the prerequisite check thread is never started on my side (don't know why).

Only if I setup the thing in a .thing file (not in the registry), it start working.
I thing the troubles come from the fact that the OnReadyMarker never get called, but don't know where to look at.
I can do some more test if you need, let me know.

Laurent.

@J-N-K
Copy link
Member

J-N-K commented Nov 10, 2024

I don't see this issue. Can you provide a step-by-step was to reproduce it, starting with a fresh installation of latest Snapshot?

@lo92fr
Copy link

lo92fr commented Nov 10, 2024

Hum, strange. I was not able to reproduce it on clean last snapshot.
Definitively need to do some more test to understand what's wrong in my case.
I will give you more information if I find the difference between my setup that was on a snapshot from a few days back, and the clean install i've made this morning.

Laurent

@lolodomo
Copy link
Contributor

lolodomo commented Nov 10, 2024

There are apparently different reasons for this problem.
One which is now identified is when some conf folders are missing. This can happen in a docker setup because the user did not create the required folders. Can we enhance the core framework to automatically create these folders if they are missing ?

I never encountered this issue in my production environment because I am using config files. But in dev environment (OH started from Eclipse), I can provide the sequence to reproduce the issue systematically:
1 - add astro as an addon in your Eclipse environment (pom, bom, ... etc)
2 - start OH inside Eclipse
3 - create a new moon astro thing for example (from jar:binding-astro) or eclipse:binding-astro)
4 - everything is fine, new thing is ONLINE
5 - stop OH and restart it
6 - you will see that the thing you previously created remains always in NOT_YET_READY status

I am not sure since when this problem appeared in dev environment but we can no more use managed things in dev. We are many to encounter this issue. The current workaround is to not use managed things but only things declared in config files (like demo.things).

@lolodomo
Copy link
Contributor

In my production environment (RPI), I still use managed things for the Matter binding and I do not encounter this issue.

@J-N-K J-N-K linked a pull request Nov 10, 2024 that will close this issue
@Nadahar
Copy link

Nadahar commented Nov 10, 2024

I can also confirm the problem when running the "demo.app" from Eclipse. I'm working on a binding, and it's actually quite problematic at the moment, since when I'm forced to use file based Things, the configuration is read-only, so I can't test/debug configuration changes made from the UI.

Another issue that I suspect is related is that even with file based Things, if I modify one of the XML files while "demo.app" is running, the change is correctly detected when I save them and something is reloaded - which leads to the same NOT_YET_READY status for the Thing:

[ab.event.ThingStatusInfoChangedEvent] - Thing 'millheatlocal:panel-heater:e7c586766b' changed from ONLINE to UNINITIALIZED
[ab.event.ThingStatusInfoChangedEvent] - Thing 'millheatlocal:panel-heater:e7c586766b' changed from UNINITIALIZED to UNINITIALIZED (HANDLER_MISSING_ERROR)
[ab.event.ThingStatusInfoChangedEvent] - Thing 'millheatlocal:panel-heater:e7c586766b' changed from UNINITIALIZED (HANDLER_MISSING_ERROR) to UNINITIALIZED (NOT_YET_READY)

@Nadahar
Copy link

Nadahar commented Nov 10, 2024

Without knowing anything about the underlying problem, I can't help but think that making whatever fails more robust by making fewer assumptions would be much better than trying to "arrange the environment" to match the assumptions by for example creating files or folders.

@J-N-K
Copy link
Member

J-N-K commented Nov 10, 2024

Config from a text file should not be modified. If you want dynamic configuration, use managed things.

@Nadahar
Copy link

Nadahar commented Nov 10, 2024

Config from a text file should not be modified. If you want dynamic configuration, use managed things.

My point is that because of this issue, it's impossible to use managed things while developing - thus making debugging and testing of dynamic configuration impossible.

@Nadahar
Copy link

Nadahar commented Nov 10, 2024

Maybe the problem I'm having when running "demo.app" from Eclipse doesn't have the same cause - even though the symptoms are the same.

I've imported "core" into Eclipse and after some hours of building been able to try out the changes in 681308d. This makes no difference whatsoever for the issue I'm experiencing.

I'd like to add that my computer really is too slow to effectively work/debug in Eclipse when "core" is included, just a simple launch can easily take a couple of minutes. Building takes "forever". So, either you guys have some insane hardware, or build/run in a different way, because this isn't really usable.

@lo92fr
Copy link

lo92fr commented Nov 11, 2024

Hello,

@Nadahar : I've take very long time last past 2 week to configure core inside my Eclipse setup.
This is true that it's very slow to compile / start openhab with this setup, and I have a quite descent computer I thing : Ryzen 7 3700X, pretty good SSD (MP600), and 16Gb.

It take me some time to configure eclipse for some optimisation to help me working on this setup.
I recommand for example:

  • To tune your eclipse.ini, mine have the following modification 👍 -Xms4096m
    -Xmx12288m
    -Xverify:none
    -Xquickstart
    -XX:PermSize=256m
    -XX:MaxPermSize=1024m
    -XX:+UseParallelGC -
    -XX:+UseG1GC
    -XX:+UseStringDeduplication

  • To disable automatic build, the setup with core is just not workable with automatic build.

  • To disable validator that your don't need.

Even with that, operation on app.bdrun (saving, modification, run) is very slow. Each operation take around ~1mn on my setup.

It will perhaps be a good thing if we extend the eclipse Ide setup documentation (https://next.openhab.org/docs/developer/ide/eclipse.html) with some recommandation about setting up core in Eclipse. There is already a recommandation about -Xmx parameter, but I think we can put more recommandation there.

Laurent.

@lo92fr
Copy link

lo92fr commented Nov 11, 2024

@J-N-K , @pashalaev,

Thanks for your hint / investiguation about ReadyMarker and ModelParser;
I will try to take a look in my setup when I've got some time for.

Laurent.

@Nadahar
Copy link

Nadahar commented Nov 11, 2024

@lo92fr Some updates to the Eclipse developer documentations would certainly be helpful. The problem is that I don't feel that I've figured it out enough that I am in a situation to give anybody else advice at this point.

Automatic build must be turned off. With it on, my computer gets stuck on the org.openhab.core.model.item package/bundle for some reason. When automatic build is off, and I build it "manually", it completes in a somewhat reasonable time. It's like it's getting stuck in some kind of "refresh/rebuild" loop.

The validator you're referring to I'm not sure what is or where to disable, but if I could skip the "validation" when launching "demo.app" it would certainly save a lot of time - because that's what's causing the bulk of the delay during launch.

As for your Eclipse options, I have no idea what is suitable, nor "safe", and all the memory configuration stuff depends on the hardware. Some hints in the documentation would probably be very helpful though.

My laptop, which I'm using for this, also only has 16 GB RAM. I've got a very fast NVMe and the CPU is an i7-8550U. The motherboard doesn't seem to support any significantly faster CPUs, so not much to do there, but I am considering 32 GB RAM - which is supported. That said, when looking at the performance monitoring, it seems like the CPU is the main bottleneck for me. It works like crazy most of the time I do something in Eclipse. But, apart from running OH from Eclipse, this laptop is just fine for everything else I do - including my "regular" development in (a different installation of) Eclipse. I've never experienced Eclipse to be slow before, except when using the marketplace/updating software, but this "OH version" sure is extremely slow. I'm guessing it is because of the OSGi stuff.

One thing to note is that we're talking about two different places where you can give JVM arguments: Eclipse.ini and in the "demo.app" launch configuration, under Runtime Properties -> JVM Arguments. The former will decide how Eclipse itself runs, how much memory etc. - the latter will decide how the "demo.app" runs when you launch it.

@lo92fr
Copy link

lo92fr commented Nov 11, 2024

@lo92fr Some updates to the Eclipse developer documentations would certainly be helpful. The problem is that I don't feel that I've figured it out enough that I am in a situation to give anybody else advice at this point.

Automatic build must be turned off. With it on, my computer gets stuck on the org.openhab.core.model.item package/bundle for some reason. When automatic build is off, and I build it "manually", it completes in a somewhat reasonable time. It's like it's getting stuck in some kind of "refresh/rebuild" loop.

The validator you're referring to I'm not sure what is or where to disable, but if I could skip the "validation" when launching "demo.app" it would certainly save a lot of time - because that's what's causing the bulk of the delay during launch.

As for your Eclipse options, I have no idea what is suitable, nor "safe", and all the memory configuration stuff depends on the hardware. Some hints in the documentation would probably be very helpful though.

My laptop, which I'm using for this, also only has 16 GB RAM. I've got a very fast NVMe and the CPU is an i7-8550U. The motherboard doesn't seem to support any significantly faster CPUs, so not much to do there, but I am considering 32 GB RAM - which is supported. That said, when looking at the performance monitoring, it seems like the CPU is the main bottleneck for me. It works like crazy most of the time I do something in Eclipse. But, apart from running OH from Eclipse, this laptop is just fine for everything else I do - including my "regular" development in (a different installation of) Eclipse. I've never experienced Eclipse to be slow before, except when using the marketplace/updating software, but this "OH version" sure is extremely slow. I'm guessing it is because of the OSGi stuff.

One thing to note is that we're talking about two different places where you can give JVM arguments: Eclipse.ini and in the "demo.app" launch configuration, under Runtime Properties -> JVM Arguments. The former will decide how Eclipse itself runs, how much memory etc. - the latter will decide how the "demo.app" runs when you launch it.

@Nadahar

In regards of where time is spent.
For me, it's mainly before launch, and not after when demo app is starting.
So optimisation have to be done in eclipse.ini.

I'm align with you, we have not enough information to update the documentation.
I was not thinking about telling we should do it right now, but that it would be great in a futur time to have more information in the documentation.

About validator, I'm not convinced that they are in cause in the launch time delay.
I think it's mainly in the bndtool that time is spent, and don't know what can be done for this.

I come to this old bdntool documentation about perf issue on windows later this morning.
https://bnd.bndtools.org/chapters/650-windows.html
I give it a try, but don't see much improvement.
Only test antivirus exclusion so far, have not test to move tmp folder to other locations.

Laurent.

@Nadahar
Copy link

Nadahar commented Nov 11, 2024

@lo92fr There are some interesting things in your link - I just found 6.26 GB (over 58000 files and folders) of leftover OSGi "temp folders". I don't think that in itself causes a slowdown, but copying all these data for each launch probably does.

As for antivirus exclusions, I have already excluded the Eclipse folder, the repo folder and the .m2 folder. But, perhaps I should try to add some more to this.

In my case, because I use TortoiseGit, I also has a background task that TortoiseGit use to update file icon overlays that goes quite crazy - so I frequently have to kill that (I don't think it's possible to disable without removing the overlays - and they are VERY useful IMO).

edit: I added the .p2 folder to antivirus exclusion too, and moved the temp folder to an excluded location. I'll report back if this makes some significant difference.

I also see that some of the "optimizations" that you've put in eclipse.ini is for very old Java versions and is no longer supported - so they probably don't do anything.

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/millheat-local-api-development-questions/160139/25

@pashalaev
Copy link

Hi! I couldn't write earlier, but maybe my research will help in some way. On my system, during system startup, the order of calling functions is as follows:

ItemRuntimeActivator.activate()
FolderObserver.addModelParser(ModelParser modelParser)
FolderObserver.activate()

As I believe, because of this sequence, the correct installation of ReadyMarker dsl:items does not occur and the creation of the openhab/conf/items folder is just a trick that allows the program to continue launching.

From the logic of the program, it becomes clear that the first thing to do is launch the FolderObserver.activate() function, but I’m not such an expert in OSGi that I can say why on my system the sequence of calls is exactly this.

Perhaps this depends on the performance of my system and then it is worth putting the Language Server bundle with ID 201 before the bundle with ID 198 Model Items to give OSGi more time to call FolderObserver.activate()

My openhab/conf folder is empty before opening openhab for the first time. Then the ./ssh and /automation folders created by openhab appear in it and then I create the /items folder manually

My system is esxi 6.5. Xeon E3-1260L 16Gb. For a virtual machine on debian12 with docker, 3 CPUs and 6GB RAM are allocated

@pashalaev
Copy link

изображение

@openhab-bot
Copy link
Collaborator

This issue has been mentioned on openHAB Community. There might be relevant details there:

https://community.openhab.org/t/millheat-local-api-development-questions/160139/118

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

Hello all,

Thanks for your very good analyze @pashalaev.
I rework this on my development setup this morning, trying to trace / understand what’s happening.

@J-N-K :
I also find that you have made a modification on ThingManagerImpl.java for adding PersistenceService Bundle tracker.
In this modification 53889cd / September 9), you change the startLevel of the ReadyMarker of ThingManagerImpl.java from 20 to 30 (STARTLEVEL_STATES instead of STARTLEVEL_MODEL).

I'm not fully analyzed what happens so far, but what I can see:

  • When i switch back to STARTLEVEL_MODEL, everything initialized correctly, the onReadyMarkerAdded of ThingManagerImpl is called, things is correctly initialized, and I reach level 100 correctly.

  • When I set the marker to STARTLEVEL_STATES, for some reason, we never hit start level 30. Openhab seems to jump directly from startLevel 20 to startLevel 40. So onReadyMarkerAdded is not called, and things is not correctly initialized.

I will tell you if I find more on this.

Laurent.

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

One more note, I don't see the marker at all for level 30 in the marker collections !
So not any chance that we hit level 30

{
80=[things=handler],
70=[dsl=sitemap],
50=[ruleengine=start],
40=[rules=refresh, managed=rule, dsl=rules, rules=dslprovider],
20=[dsl=persist, managed=thing, managed=itemchannellink, managed=item, dsl=items, dsl=things]
}

@J-N-K
Copy link
Member

J-N-K commented Nov 29, 2024

Please have a look at https://github.com/openhab/openhab-distro/blob/main/distributions/openhab/src/main/resources/runtime/services.cfg

There are several ready markers. It seems that you have an issue with your persistence and automation setup.

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

Ok,

I don't undestrand fully the reason for now, but it seems that something is not correctly registered somewhere.
I'm not familliar with the way this marker / readyservice are working.
I need more time to fully understand how it works.

Laurent.

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

I try to undestand why this level in not showing in StartLevelService.
When the activate method is called, i don't see them on the configuration Map.
They should come from the services.cfg file ?

{osgi.ds.satisfying.condition.target=(osgi.condition.id=true), component.name=org.openhab.core.service.StartLevelService, 80=things:handler, 50=ruleengine:start, 20=dsl:items,managed:item,dsl:things,managed:thing,managed:itemchannellink,dsl:persist, service.pid=org.openhab.startlevel, component.id=37, 70=dsl:sitemap, 40=dsl:rules,managed:rule,rules:refresh,rules:dslprovider}

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

I just see that in openhab-distro there is a comment on the line !
I give it a try to see if it comes from there !

Start level definition

startlevel:20=dsl:items,managed:item,dsl:things,managed:thing,managed:itemchannellink,dsl:persist
#startlevel:30=persistence:restore // do not rely on the presence of a persistence service in the IDE
startlevel:40=dsl:rules,managed:rule,rules:refresh,rules:dslprovider
startlevel:50=ruleengine:start
startlevel:70=dsl:sitemap
startlevel:80=things:handler

org.jupnp:autoEnable=false
org.jupnp:threadPoolSize=20
org.ops4j.pax.logging:org.ops4j.pax.logging.log4j2.config.file=../../../../runtime/log4j2.xml

@lo92fr
Copy link

lo92fr commented Nov 29, 2024

So I can confirm now.
In developpement environment, the troubles come from the fact that the file

https://github.com/openhab/openhab-distro/blob/main/launch/app/runtime/services.cfg

is not in sync with

https://github.com/openhab/openhab-distro/blob/main/distributions/openhab/src/main/resources/runtime/services.cfg

So ready marker for level 30 are not there, and ThingManagerImpl is not initialized correctly in dev env.
If it's ok for you, I will make a pull request to fix the file.

One more question :

Do we really need this duplicates files in openhab-distro between the launch/app/runtime and openhab*/src folder.
If I look for exemple for the items folder, we have 3 instances:

  • openhab-distro\launch\app\runtime\conf\items
  • openhab-distro\distributions\openhab\src\main\resources\conf\items
  • openhab-distro\distributions\openhab-demo\src\main\resources\conf\items

Every time I ask myself which one is the good one to modify !

Laurent.

@lolodomo
Copy link
Contributor

Interesting at least for developers. I will test your change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug An unexpected problem or unintended behavior of the Core
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants