-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use BadTooManyPublishRequests as a means to limit publish requests #400
Use BadTooManyPublishRequests as a means to limit publish requests #400
Conversation
This seems like a mostly reasonable approach, but I believe I added the I'm not sure what the best approach is. If you were running into issues with |
3b347b5
to
eeb056d
Compare
I rebased this PR on #402, but will work on this one some more to also take your comment into account (and I noticed not all my issues are solved yet, so there is something else that needs a tweak or two). So marking this as draft PR for now... |
6867857
to
8e3796a
Compare
Hi @einarmo after testing some more it turns out that most of our issues where due to #404 which caused (over time) a lot of reconnects which caused a "subscription leak" (due to the fact both the client and our own code tried to transfer or recreate the subscriptions). So #404 fixed the root cause of the problem and I will make another small PR to make it optional for the client to transfer or recreate subscriptions on reconnect (for now I have just disabled it in our fork). Unfortunately after running our service again for a day or so (without the changes from this PR), we still noticed an issue which seems related to the discussion we had in this thread. What we see is that over time data from the server is starting to lag behind. After running the service for about 12 hours, the data that is coming in is about 1.5 hours old. The data is the correct data for the timestamps associated with the values and the lagging data is send in an up-to-date After testing and debugging our conclusion is that it is related to the newly introduced
I guess the main question is if the client should have a Yet (in our case at least) we don't know the details of the servers we connect to and we also don't know the config of the monitored items that are going to be set (e.g. Since we do not know the total amount of subscriptions upfront, we have to way to tell how many And of course related to that is that if we don't know how to set Now of course we could just make the limits ridiculously high, but that doesn't feel like a good and generic solution. In the end the So I would like to remove the introduced limits and somewhat combine the two approaches described in the spec mainly using
While its true that this means the client can use more memory, at the same time each request has its own timeout so the maximum outstanding requests is already limited by timeout / publish interval + maybe some additional requests like browse or create subscription which also have a timeout. But before I go an update stuff I though to ask your take on this one to prevent us bumping into one another with different perspectives or requirements 😏 |
I don't fully understand how it's possible that the max_inflight_publish parameter could cause the server to delay the data by 1 and a half hours that sounds like a server problem, not a client problem. I wonder if the server doesn't properly set the
This sounds somewhat reasonable, but the session only runs a single publish loop. All we really do is send publish requests at the rate of the subscription with the shortest publishing interval. We rely on It may be that what you really need is a different publishing strategy entirely. Especially if you run many subscriptions. Can you tell if the client is constantly sending publish requests? If it's never taking a break, and always sending a new request when it receives a response, that could mean we need to rethink our publishing strategy. The remark about timeout is fair, and it may be that we should drop the inflight requests thing entirely. The ability to limit the number of other requests is something the caller could be responsible for. I'm open for dropping that feature entirely, in which case using |
The "funny" thing is that we are only monitoring 3 items right now in our test setup. As far as I understand the spec, the In our case that is never true for one publish interval and it looks to me like the server is somehow still holding on to the samples gathered for each individual publish interval and is still trying to send those in individual publish responses. No idea if that is somehow following the spec (or someones' interpretation of it), but I agree its questionable behavior.
I actually think it is the latter in our case. I think that every now and then, maybe when some additional processes run on the server, it does take too long to process a publish response. And so after that peek load it would actually need a small burst of publish requests in order to be able to catch up again. We have seen in a 48 hour test that removing the limits we do not see the same issue, so given enough publish requests it does seem to be able to keep up.
Yes I know, and I agree that is not always the best solution depending on the setup someone is using. But I have an idea how to (to a certain extend) somewhat combine the two approaches as described in the spec which hopefully gives a working solution for all use cases. We can have a closer look when I'm done (still need to write and test it) and have updated the PR. |
8e3796a
to
fcc89d8
Compare
Closed by accident, so opened a new PR with the updated solution #408 |
This partially reverts some changes made when rewriting the client to a full async client. I encountered some issues with the new limit and was unable to (easily) figure out what would be a good limit for our setup.
After making the changes in this PR things worked smooth again (without actually receiving any
BadTooManyPublishRequests
messsages from the 3rd party servers we are connecting to). I will let our tests run for a little while and increase the load (numbers of subscriptions and monitored items), but I think this approach is both easier for users of he crate as well as better for the dataflow.Of course open for any insights or feedback which counter my suggestions!