Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use BadTooManyPublishRequests as a means to limit publish requests #400

Closed

Conversation

svanharmelen
Copy link
Contributor

This partially reverts some changes made when rewriting the client to a full async client. I encountered some issues with the new limit and was unable to (easily) figure out what would be a good limit for our setup.

After making the changes in this PR things worked smooth again (without actually receiving any BadTooManyPublishRequests messsages from the 3rd party servers we are connecting to). I will let our tests run for a little while and increase the load (numbers of subscriptions and monitored items), but I think this approach is both easier for users of he crate as well as better for the dataflow.

Of course open for any insights or feedback which counter my suggestions!

@einarmo
Copy link
Contributor

einarmo commented Nov 5, 2024

This seems like a mostly reasonable approach, but I believe I added the max_inflight_publish limit in part because the client itself has a max_inflight_request parameter, so if the server is unreasonable, and sets a very high limit on the number of queued publish requests, we risk running out of requests.

I'm not sure what the best approach is. If you were running into issues with max_inflight_publish that sounds like a bug on its own, and I'm curious how that happened. Ideally I think we'd make the changes you do here, but still have an upper limit on the number of inflight publish requests, to avoid creating tons for servers that don't behave well.

@svanharmelen svanharmelen force-pushed the feat/max-publish-requests branch from 3b347b5 to eeb056d Compare November 6, 2024 13:06
@svanharmelen
Copy link
Contributor Author

I rebased this PR on #402, but will work on this one some more to also take your comment into account (and I noticed not all my issues are solved yet, so there is something else that needs a tweak or two). So marking this as draft PR for now...

@svanharmelen svanharmelen marked this pull request as draft November 6, 2024 13:12
@svanharmelen svanharmelen force-pushed the feat/max-publish-requests branch 2 times, most recently from 6867857 to 8e3796a Compare November 10, 2024 10:01
@svanharmelen
Copy link
Contributor Author

Hi @einarmo after testing some more it turns out that most of our issues where due to #404 which caused (over time) a lot of reconnects which caused a "subscription leak" (due to the fact both the client and our own code tried to transfer or recreate the subscriptions).

So #404 fixed the root cause of the problem and I will make another small PR to make it optional for the client to transfer or recreate subscriptions on reconnect (for now I have just disabled it in our fork).

Unfortunately after running our service again for a day or so (without the changes from this PR), we still noticed an issue which seems related to the discussion we had in this thread.

What we see is that over time data from the server is starting to lag behind. After running the service for about 12 hours, the data that is coming in is about 1.5 hours old. The data is the correct data for the timestamps associated with the values and the lagging data is send in an up-to-date PublishResponse messsages (so with a handle and timestamp of a recently send PublisRequest), but its just lagging way behind.

After testing and debugging our conclusion is that it is related to the newly introduced max_inflight_publish and max_inflight_messages parameters so I would like to discuss the both of them 😊

This seems like a mostly reasonable approach, but I believe I added the max_inflight_publish limit in part because the client itself has a max_inflight_request parameter, so if the server is unreasonable, and sets a very high limit on the number of queued publish requests, we risk running out of requests.

I guess the main question is if the client should have a max_inflight_messages parameter. There is nothing in the spec that mentions this is required so I assume this is only done to protect the client from using up too much memory. While that makes sense, the problem is that its near impossible to know what value to use for this parameter unless you have access to the config used by the OPC server you are going to connect to.

Yet (in our case at least) we don't know the details of the servers we connect to and we also don't know the config of the monitored items that are going to be set (e.g. sampling_interval and queue_size). Additionally our sessions have one "main" subscription, but during runtime additional short lived subscriptions can be added and/or removed from the session on the fly.

Since we do not know the total amount of subscriptions upfront, we have to way to tell how many max_inflight_publish we should configure as each subscription would need at least one publish request per publishing interval, but depending on the network latency, maybe even more to prevent an empty publish request queue on the server.

And of course related to that is that if we don't know how to set max_inflight_publish, we also don't know how to set max_inflight_messages since the former must be lower than the latter to prevent other messages (e.g. browse or create subscription requests) from being send.

Now of course we could just make the limits ridiculously high, but that doesn't feel like a good and generic solution. In the end the BadTooManyPublishRequests is there for a reason, so to me it makes the most sense to use that as the main input for determining is a publish request could/should be send.

So I would like to remove the introduced limits and somewhat combine the two approaches described in the spec mainly using BadTooManyPublishRequests to limit what is being send:

Client strategies for issuing Publish requests may vary depending on the networking delays between the Client and the Server. In many cases, the Client may wish to issue a Publish request immediately after creating a Subscription, and thereafter, immediately after receiving a Publish response.

In other cases, especially in high latency networks, the Client may wish to pipeline Publish requests to ensure cyclic reporting from the Server. Pipelining involves sending more than one Publish request for each Subscription before receiving a response. For example, if the network introduces a delay between the Client and the Server of 5 seconds and the publishing interval for a Subscription is one second, then the Client shall issue Publish requests every second instead of waiting for a response to be received before sending the next request.

While its true that this means the client can use more memory, at the same time each request has its own timeout so the maximum outstanding requests is already limited by timeout / publish interval + maybe some additional requests like browse or create subscription which also have a timeout.

But before I go an update stuff I though to ask your take on this one to prevent us bumping into one another with different perspectives or requirements 😏

@einarmo
Copy link
Contributor

einarmo commented Nov 11, 2024

I don't fully understand how it's possible that the max_inflight_publish parameter could cause the server to delay the data by 1 and a half hours that sounds like a server problem, not a client problem. I wonder if the server doesn't properly set the MoreNotifications parameter on publish responses? Having multiple queued publish requests should never really happen unless one of the following are true:

  • No data is arriving on the subscription.
  • It takes a very long time to process a publish response (> publish interval) (in this case you have a real problem)

Since we do not know the total amount of subscriptions upfront, we have to way to tell how many max_inflight_publish we should configure as each subscription would need at least one publish request per publishing interval, but depending on the network latency, maybe even more to prevent an empty publish request queue on the server.

This sounds somewhat reasonable, but the session only runs a single publish loop. All we really do is send publish requests at the rate of the subscription with the shortest publishing interval. We rely on MoreNotifications to tell if we need to send more publish requests than that.

It may be that what you really need is a different publishing strategy entirely. Especially if you run many subscriptions. Can you tell if the client is constantly sending publish requests? If it's never taking a break, and always sending a new request when it receives a response, that could mean we need to rethink our publishing strategy.

The remark about timeout is fair, and it may be that we should drop the inflight requests thing entirely. The ability to limit the number of other requests is something the caller could be responsible for. I'm open for dropping that feature entirely, in which case using BadTooManyPublishRequests would be the appropriate way to throttle publish requests.

@svanharmelen
Copy link
Contributor Author

I don't fully understand how it's possible that the max_inflight_publish parameter could cause the server to delay the data by 1 and a half hours that sounds like a server problem, not a client problem. I wonder if the server doesn't properly set the MoreNotifications parameter on publish responses?

The "funny" thing is that we are only monitoring 3 items right now in our test setup. As far as I understand the spec, the moreNotifications flag is set when the server wants to publish more notification than maxNotificationsPerPublish set on the subscription.

In our case that is never true for one publish interval and it looks to me like the server is somehow still holding on to the samples gathered for each individual publish interval and is still trying to send those in individual publish responses. No idea if that is somehow following the spec (or someones' interpretation of it), but I agree its questionable behavior.

Having multiple queued publish requests should never really happen unless one of the following are true:

No data is arriving on the subscription.
It takes a very long time to process a publish response (> publish interval) (in this case you have a real problem)

I actually think it is the latter in our case. I think that every now and then, maybe when some additional processes run on the server, it does take too long to process a publish response. And so after that peek load it would actually need a small burst of publish requests in order to be able to catch up again. We have seen in a 48 hour test that removing the limits we do not see the same issue, so given enough publish requests it does seem to be able to keep up.

This sounds somewhat reasonable, but the session only runs a single publish loop. All we really do is send publish requests at the rate of the subscription with the shortest publishing interval. We rely on MoreNotifications to tell if we need to send more publish requests than that.

Yes I know, and I agree that is not always the best solution depending on the setup someone is using. But I have an idea how to (to a certain extend) somewhat combine the two approaches as described in the spec which hopefully gives a working solution for all use cases.

We can have a closer look when I'm done (still need to write and test it) and have updated the PR.

@svanharmelen svanharmelen force-pushed the feat/max-publish-requests branch from 8e3796a to fcc89d8 Compare November 14, 2024 10:14
@svanharmelen
Copy link
Contributor Author

Closed by accident, so opened a new PR with the updated solution #408

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants