Skip to content
This repository has been archived by the owner on Apr 26, 2024. It is now read-only.

Shared rooms API doesn't work if the worker handling the request doesn't update the user directory #10339

Closed
Half-Shot opened this issue Jul 8, 2021 · 30 comments · Fixed by #12038
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.

Comments

@Half-Shot
Copy link
Collaborator

Half-Shot commented Jul 8, 2021

#7785 added ^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$.

We check the update_user_directory config flag when handling this API, to basically sanity check that the tables we're about to query are roughly up-to-date (users_who_share_private_rooms, users_in_public_rooms). However, if the request is handled on a different synapse worker which does not have that config flag set, then the request is wrongly refused.

I'm not sure if there is an easy way to check if another Synapse process is configured to update the user directory, or whether we should just remove the check. Another option would just be to ensure this endpoint is handled on the same worker.

@anoadragon453
Copy link
Member

One possibility would be to shift the update_user_directory from a boolean to something like worker_to_update_user_directory: worker_name.

Assuming this option is set for all workers, then workers can both:

  1. Determine whether or not they should be the one to update the user directory.
  2. Determine whether another worker is handling that, or if the feature is disabled entirely.
    • Though one caveat is that I don't believe workers are able to check if a worker with a given name is actually running. They'd just have to assume if the option is set, then that worker is up and active.

We could also sort of munge that today with the boolean option (True vs False vs None), especially as the option is not included in the sample config today.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Nov 6, 2021

I think the best course of action would be to simply note that this api needs to also be served on the same worker as the user directory one, and return 501 from the local worker if it isnt. The problem of different APIs needing worker-affinity is a wider one, and i don't think it can be easily solved here.

@anoadragon453
Copy link
Member

The check on update_user_directory seems like a bad solution as it prevents the separation of a user directory worker and a worker that handles the shared rooms API. When picking between teaching people to read the documentation and limiting everyone from being able to scale, I would rather choose the former.

On the other side however, we have a choice between failing loudly (returning an error on the shared rooms API if update_user_directory is false) and failing subtly (the shared rooms API works, but is out of date). The latter being especially subtle, as the rooms you have in common with a user are not likely to change often.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Nov 8, 2021

If you were to guess, how out-of-date would the table be, if it's requested from another worker?

Is the user directory worker configured to only flush every so often? What'd be a reason why the table be out of date? Database driver caching?

@anoadragon453
Copy link
Member

@ShadowJonathan I don't think it's problematic to have another process handle shared room API calls while another worker is updating the user directory. The user directory also updates the database immediately, rather than routinely flushing results, so there shouldn't be much delay if any.

When I said the table would be out of date, I meant in the case that there was no worker process at all updating the user directory.

@ShadowJonathan
Copy link
Contributor

So the problem is essentially that we need to be sure that the user directory is constantly being updated while this API is available to be queried from? I guess that is indeed something i'd like to see assured as well before we expose this.

In that case, I definitely like the worker_to_update_user_directory: worker_name approach, which can have a worker name, or be empty to have the main thread handle it. The only assurance that there has to be is that the admin has to make sure that worker exists. (Or is there a way to "ping" for worker availability while the server is running? If so, I'll suggest that the main thread tries to continually ping for that worker (every minute or so), and if it hasnt responded for 5 minutes, it'll temporarily take over the duty of updating the user directory, all while spitting out warnings.)

@anoadragon453
Copy link
Member

It's not quite that simple unfortunately. Not all workers need to have HTTP listeners configured, so there is no standard way to ping them currently.

I don't think we need to come up with too technical of a solution though. I would be in favour of switching to a worker_to_update_user_directory option. If that's set to None, then this endpoint will raise an error, otherwise it will try to return the most up-to-date information it has.

If you've forgotten to start a worker which you've disabled configuration elsewhere for, then that'll cause problems far beyond this endpoint. It may be worth making that situation less idiot-proof, but it's outside the scope of this issue.

@ShadowJonathan
Copy link
Contributor

Wouldn't it be easy to map None to have the main process do this task? I don't see the point of assigning a specific worker for updating user directories, if the main process couldn't also be doing it.

@anoadragon453
Copy link
Member

Aha, yes, technically speaking that would work out. Though that's something we've not yet done before, so we should be cautious before choosing that solution. As a devil's advocate, I can think of two potential reasons to be wary:

  1. We want to reduce the responsibility of the main process, such that at some point we don't have a "main process" anymore. This would increase that responsibility.
  2. Is there any reason to disable the functionality to update the user directory? This would leave no option to.

Something that may solve both of these problems is to give the main process a "name" that could be put in place of None/null here.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Nov 10, 2021

I think a prerequisite to siding more fully with workers is to develop some comprehensive solution to the two main problems ive seen floating around;

  1. Workers cannot actively check the health of other workers, and so are "flying blind" on configuration, this may cause problems (such as here)
  2. Some things cannot be changed dynamically, and changing number of workers of some part of the server requires a complete reboot of all workers.

Anyways, those two problems are separate to this, i think the best course of action then is to introduce worker_to_update_user_directory, with "main_process" as a pseudonym for the main process.

How's that sound?

@anoadragon453
Copy link
Member

Hmm, actually, now that I've had some time to think about it I go back on what I said. Canonicalising a name for the main process only further cements it as being a single worker. And I don't think there's really a valid reason to disable updating the user directory, especially if we start relying on its functionality for other endpoints...

So I'm actually now in favour of your original suggestion in #10339 (comment), where None means the main process.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Nov 17, 2021

Alright then, the next course of action seems to be to;

  • Add worker_to_update_user_directory to the main/shared config, with default value None, and it meaning the main process is responsible for it.
  • Enabling the main process to run user directory updates when worker_to_update_user_directory is None
  • Remove the check on update_user_directory.

@anoadragon453
Copy link
Member

@ShadowJonathan That sounds sensible to me.

It's acknowledged that if this value is set to something other than a running worker then the user directory will not be updated and that the shared rooms API may become out of date.

@ShadowJonathan
Copy link
Contributor

👍

I’ll be sure to add a config documentation note about that as well.

@richvdh richvdh changed the title Shared rooms API (MSC2666) doesn't work if the worker handling the request doesn't update the user directory Shared rooms API ([MSC2666](https://github.com/matrix-org/matrix-doc/pull/2666) doesn't work if the worker handling the request doesn't update the user directory Feb 8, 2022
@richvdh richvdh changed the title Shared rooms API ([MSC2666](https://github.com/matrix-org/matrix-doc/pull/2666) doesn't work if the worker handling the request doesn't update the user directory Shared rooms API ( [MSC2666](https://github.com/matrix-org/matrix-doc/pull/2666) ) doesn't work if the worker handling the request doesn't update the user directory Feb 8, 2022
@richvdh richvdh changed the title Shared rooms API ( [MSC2666](https://github.com/matrix-org/matrix-doc/pull/2666) ) doesn't work if the worker handling the request doesn't update the user directory Shared rooms API (MSC2666) doesn't work if the worker handling the request doesn't update the user directory Feb 8, 2022
@richvdh
Copy link
Member

richvdh commented Feb 8, 2022

wait, what endpoint are we actually complaining about here? MSC2666 is /_matrix/client/unstable/uk.half-shot.msc2666/user/mutual_rooms/{user_id}, but that's not implemented in synapse. So, you should get a 404. Correct? Please don't make me reverse-engineer this stuff.

@richvdh
Copy link
Member

richvdh commented Feb 8, 2022

Ok, I think, in spite of what the MSC says, we're talking about ^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$ here. Confirmation would be appreciated. This is really, really, basic stuff that ought to be at the top of the issue description - I shouldn't have to spend 15 minutes trying to figure out what we are talking about before I can even start thinking about the problem.

@Half-Shot
Copy link
Collaborator Author

(The endpoint was changed in the MSC long after the Synapse impl, and this issue was created)

@richvdh richvdh changed the title Shared rooms API (MSC2666) doesn't work if the worker handling the request doesn't update the user directory Shared rooms API doesn't work if the worker handling the request doesn't update the user directory Feb 8, 2022
@richvdh
Copy link
Member

richvdh commented Feb 8, 2022

(The endpoint was changed in the MSC long after the Synapse impl, and this issue was created)

(a) then the synapse impl needs updating, no?
(b) if the issue mentioned the actual endpoint rather than making me google an MSC number in the first place, then that wouldn't be a problem.

I've now updated the title and description. I'm annoyed at having to waste time on this.

@richvdh
Copy link
Member

richvdh commented Feb 8, 2022

I remain confused about the problem and the proposed solution here.

AIUI, the current situation is that the user directory must be updated and searched on a single worker process. "Searching" means the current documented endpoint ^/_matrix/client/(api/v1|r0|v3|unstable)/user_directory/search$, as well as the currently undocumented endpoint ^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$.

This issue proposes to relax that requirement for shared_rooms, I think. Instead of "you must search on the worker performing the update", the requirement becomes "you can search on any worker, provided there exists a worker which is doing updates.

Please could somebody explain to me why they believe that is safe? The guard must have been there for a reason; if it no longer applies, what has changed? And if it is safe, does this also apply to user_directory/search, and if not, why not?

@ShadowJonathan
Copy link
Contributor

(a) then the synapse impl needs updating, no?

I was planning to do that after #11450, I wanted that to finish first before I touched that change.

The guard must have been there for a reason; if it no longer applies, what has changed?

Nothing wrt the code, the idea behind this config value was that it'd force the admin to take a closer look at what they'd put there, while defaulting the value to the main process.

The idea is that, if it is at least running somewhere, and because there is higher visibility as to where it is running from the config (because you have to specify the specific worker's name), there's a higher chance that the user directory updater is running somewhere in the worker cluster.

#11450 tries to encode this in the config, as it defaults the user directory updater to the main process (which we'll assume is always running), and makes any change have to have the admin point it to a specific worker, which has a slightly better chance of being corrected if the worker doesnt exist anymore, if an admin were to glance it. (Compared to a False from update_user_directory)

This would allow the endpoint to run on any client-server servlet, though I see this isn't marginally any more safe.

The real safe way to do this would be to introduce application-level worker health awareness in synapse, but i'm not sure how manageable that is.

Frankly, this is the issue that was pointed to that was blocking everything, and I wanted to put some progress in this feature.

@anoadragon453
Copy link
Member

TL;DR this has gotten complicated and I'm no longer convinced that the proposed solution actually improves the situation over the config we have today. Shall we just remove the check instead?

In summary:

MSC2666 introduced an endpoint for clients to discover what rooms the user shares with any other given user. The endpoint has changed shape in December, but was initially implemented in Synapse as ^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$.

The handler for this endpoint uses the get_shared_rooms_for_users db query, which relies on up-to-date entries in the users_in_public_rooms and users_who_share_private_rooms table. These tables are maintained by the UserDirectoryHandler, specifically _track_user_joined_room, which will only run if update_user_directory is True in the worker's config.

Because of this, the original implementation included a check for whether update_user_directory was enabled. If not, the endpoint would refuse to return any results - for fear that they may be old. However, the check did not take worker use cases into account; if a separate worker from that answering the /shared_rooms endpoint was updating the users_in_public_rooms and users_who_share_private_rooms tables, the /shared_rooms worker would incorrectly assume that the tables were not being updated.

The solution initially proposed above was to deprecate the update_user_directory: True|False config option, and replace it with one update_user_directory_on: WorkerName|None. This would allow workers to be able to tell the difference between the functionality being completely disabled (None) and running on a separate worker (WorkerName == self.worker_name). The /shared_rooms endpoint would then only raise on update_user_directory_on being None, granting sysadmins the flexibility to run user directory updates and handle this endpoint on separate workers.

A bit later, in #10339 (comment) a few comments down, it was proposed for None to instead mean "the main process". Updating the user directory would then always occur (either on the named worker, or the main process). Rejoice! The user directory tables would always be up to date. However, it was then noted in #11450 (comment) that it is useful to be able to disable the functionality for homeservers that don't need it. This could be achieved by setting update_user_directory_on to the name of a worker that does not exist.


Honestly, after having read everything over I think we've essentially arrived at where we started. The solution of update_user_directory_on: WorkerName|None (where None means the main process) grants us the ability for workers to know which other worker is handling user directory updates. Yet that is not actually helpful in the context of this issue -- the /shared_rooms endpoint only cares about whether it is being updated at all.

And at this point I feel that's something that it shouldn't care about. If we just remove the check, we gain the ability to run this endpoint across different workers, without the need for deprecations. If the user directory isn't running on a configured worker - which is not the default behaviour - then that was the choice of the sysadmin; and the docs should warn about that implication.

An option like update_user_directory_on: WorkerName may come in handy in the future if processes end up needing to replicate to the worker handling user directory updates -- but that's not the case today.

Sorry for the massive run-around.

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Feb 14, 2022

Thanks for the summary, this is helpful with knowing where this issue stands.

I agree with the above points, and I'd wanna make a PR which simply removes the check.

I think at this point #11450 has run its course, and at this point it's no use to push it forward anymore, so I'm thinking of closing it on that note (if everyone agrees with this).

Instead, to resolve this issue, I'll open a simple PR which removes the check and adds an extra note/warning to the update_user_directory config docs.

How's that sound?

@anoadragon453
Copy link
Member

Sounds good to me.

@richvdh
Copy link
Member

richvdh commented Feb 14, 2022

AIUI, the current situation is that the user directory must be updated and searched on a single worker process. "Searching" means the current documented endpoint ^/_matrix/client/(api/v1|r0|v3|unstable)/user_directory/search$, as well as the currently undocumented endpoint ^/_matrix/client/unstable/uk.half-shot.msc2666/user/shared_rooms/.*$.

Could you be explicit as to whether this is a correct understanding? https://github.com/matrix-org/synapse/blob/develop/docs/workers.md#synapseappuser_dir says that /user_directory/search must be routed to the user_dir worker. If that is correct, I am nervous about simply removing the check that you propose: again, I don't understand why it would be ok to relax the constraint for shared_rooms but not /user_directory/search.

(OTOH, it would be far from the only place that the workers doc is misleading/wrong.)

@ShadowJonathan
Copy link
Contributor

ShadowJonathan commented Feb 14, 2022

From a quick look at the code, it doesnt seem like /user_directory/search necessarily requires to be handled by the worker that is updating it, so those docs are wrong with being so strict.

I imagine it was written this way to make sure that there's a running user directory updater somewhere (either on main or on the worker), as to make things more fool-proof. The only real thing these endpoints (search and shared_rooms) require is a running user directory updater, be it either on main or on the dedicated worker. I think documenting this could maybe help your concerns?

Imo, the documentation should mention that the preferred way to do this is to make sure those paths are handled by the dedicated worker, but that it can be handled by any other (if the admin knows what they're doing).

(Though the code confusingly sets the config update_user_directory runtime variable to True somewhere halfway setup, so this access will be True if it's the main process and update_user_directory is True, though then also True when the current worker_app is [...].user_dir, but False on any other worker)

@richvdh
Copy link
Member

richvdh commented Feb 15, 2022

right, understood. Please could you open an issue noting the incorrect documentation?

@richvdh
Copy link
Member

richvdh commented Feb 15, 2022

(it also means that #11450 probably is the right solution to that problem, even if it's not the right solution to this problem: you should be able to choose which worker updates the directory, without it having to be dedicated to that purpose. See also #10441)

@anoadragon453
Copy link
Member

anoadragon453 commented Feb 15, 2022

(it also means that #11450 probably is the right solution to that problem, even if it's not the right solution to this problem: you should be able to choose which worker updates the directory, without it having to be dedicated to that purpose. See also #10441)

Ignoring worker types, I'm not sure how one worker having update_user_directory: true and the others false is different from all workers having update_user_directory_on: WorkerName|None. Other than the ability for the workers to know which other worker is updating the user directory. update_user_directory can be used to choose which worker to update the user directory.

I think we can also deprecate and remove the synapse.app.user_dir worker type without needing to change this option. We just need to tell people to add update_user_directory: false to their main config, and update_user_directory: true to their worker config, and change the worker type from synapse.app.user_dir to synapse.app.generic_worker.

Edit: Ah, so the update_user_directory_on option would additionally be useful in this case. Still, I continue to believe removing worker types isn't a requirement to resolve this issue.

@ShadowJonathan
Copy link
Contributor

and update_user_directory: true to their worker config,

Actually, this will not work, as the config script will force the config option off for all other kinds of workers;

if config.worker.worker_app == "synapse.app.user_dir":
if config.server.update_user_directory:
sys.stderr.write(
"\nThe update_user_directory must be disabled in the main synapse process"
"\nbefore they can be run in a separate worker."
"\nPlease add ``update_user_directory: false`` to the main config"
"\n"
)
sys.exit(1)
# Force the pushers to start since they will be disabled in the main config
config.server.update_user_directory = True
else:
# For other worker types we force this to off.
config.server.update_user_directory = False

Unless we remove this bit, ofc, but we'd need a migration path for that, as it could then lead to all processes simultaneously believing they should update the user directory.


In any case, i'm currently interested in resolving this issue, so i'm still intending to submit a PR removing this check, adding extra documentation to update_user_directory, and also update workers.md wrt these details.

@anoadragon453
Copy link
Member

If we remove the check on update_user_directory: True in the shared rooms code, then that unbinds the shared_rooms API from needing to be run on a user directory worker. It could instead be run on a generic worker, while the user directory is updated on a separate, synapse.app.user_dir worker.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
T-Defect Bugs, crashes, hangs, security vulnerabilities, or other reported issues.
Projects
None yet
4 participants