Skip to content
This repository has been archived by the owner on Nov 25, 2024. It is now read-only.

Add NATS JetStream support #1866

Merged
merged 108 commits into from
Jan 5, 2022
Merged

Conversation

S7evinK
Copy link
Contributor

@S7evinK S7evinK commented May 30, 2021

This adds experimental support for NATS JetStream by implementing the required interfaces for sarama.

Currently uses my experimental NATS JetStream sarama implementation, seems like there are still a few issues in there, as e.g. receipts/read markers don't seem to always come through.

On a side note, I was able to join #dendrite-dev and still receive messages in other rooms.

Kafka (with 3 nodes, 1 zookeeper) is using about ~4G of memory.. NATS on the other side (also 3 nodes, no manager needed) - 120M. (As reported by Prometheus) Kafka/NATS are both idle.

Some screenshots from NATS:
image

image

Pull Request Checklist

  • I have added any new tests that need to pass to sytest-whitelist as specified in docs/sytest.md
  • Pull request includes a sign off

@S7evinK S7evinK marked this pull request as draft June 4, 2021 18:29
Keep typing events for at least one minute
@kegsay
Copy link
Member

kegsay commented Jun 7, 2021

Fantastic work on this, thanks @S7evinK for all of your contributions thus far! I'll try to block out some time to look at this, but it'll take a while to migrate fully to NATS. We still need to sketch out and profile whether the swap would be net positive for large deployments. The main drivers for wanting to swap away from Kafka/Naffka are:

  • Unbounded log growth - yes Kafka has a retention period but it's become clearer as the project has developed that we primarily want tiny retention periods, particularly for embedded use cases.
  • Inability to send ephemeral events (generic pubsub to allow us to request actions on other components and register callbacks). Being able to do this will open up more ways to prevent race conditions between components.

@S7evinK
Copy link
Contributor Author

S7evinK commented Jul 9, 2021

Even after some more testing and trying to fix things, still the same issue with receipts not coming through. Looks like, at least that's what i guess, that only m.fully_read make their way to Element.

After a few days in #dendrite:

# nats stream report
Obtaining Stream stats

+---------------------------------------------------------------------------------------------------------------+
|                                                 Stream Report                                                 |
+---------------------------------+---------+-----------+----------+---------+------+---------+-----------------+
| Stream                          | Storage | Consumers | Messages | Bytes   | Lost | Deleted | Replicas        |
+---------------------------------+---------+-----------+----------+---------+------+---------+-----------------+
| DendriteOutputSendToDeviceEvent | File    | 2         | 0        | 0 B     | 0    | 0       | nats-cluster-0* |
| DendriteOutputTypingEvent       | File    | 2         | 0        | 0 B     | 0    | 0       | nats-cluster-0* |
| DendriteOutputClientData        | File    | 1         | 57       | 9.7 KiB | 0    | 0       | nats-cluster-0* |
| DendriteOutputReceiptEvent      | File    | 2         | 535      | 164 KiB | 0    | 0       | nats-cluster-2* |
| DendriteOutputKeyChangeEvent    | File    | 2         | 798      | 711 KiB | 0    | 0       | nats-cluster-2* |
| DendriteOutputRoomEvent         | File    | 2         | 1,537    | 4.0 MiB | 0    | 0       | nats-cluster-1* |
+---------------------------------+---------+-----------+----------+---------+------+---------+-----------------+

…ntly this is upsetting tests that don't expect that"

This reverts commit 3686752.
Squashed commit of the following:

commit 13f9028
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 15:47:14 2022 +0000

    Do the same for leave

commit e6be7f0
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 15:33:42 2022 +0000

    Enforce state key matches sender

commit 85ede6d
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 14:07:04 2022 +0000

    Fix panics on closed channel sends

commit 9755494
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 13:38:22 2022 +0000

    Don't report any errors on `/send` to see what fun that creates

commit 3bb4f87
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 13:00:26 2022 +0000

    Revert "Don't report event rejection errors via `/send`, since apparently this is upsetting tests that don't expect that"

    This reverts commit 3686752.

commit fe2673e
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 12:09:34 2022 +0000

    Go 1.16 instead of Go 1.13 for upgrade tests and Complement

commit 3686752
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 11:51:45 2022 +0000

    Don't report event rejection errors via `/send`, since apparently this is upsetting tests that don't expect that

commit b028dfc
Author: Neil Alexander <[email protected]>
Date:   Tue Jan 4 10:29:08 2022 +0000

    Send final event in `processEvent` synchronously (since this might stop Sytest from being so upset)
@neilalexander neilalexander marked this pull request as ready for review January 4, 2022 16:07
@neilalexander neilalexander requested a review from a team as a code owner January 4, 2022 16:07
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two main complaints:

  • Phony changes are an unwelcome addition to this already enormous PR.
  • Scattered Ack() calls throughout consumer code makes me extremely nervous that we will forget to ACK in places and I don't know what happens then? I guess it will get stuck on that event which is obviously not ideal.

clientapi/routing/createroom.go Show resolved Hide resolved
dendrite-config.yaml Show resolved Hide resolved
dendrite-config.yaml Show resolved Hide resolved
docs/INSTALL.md Outdated Show resolved Hide resolved
federationapi/consumers/eduserver.go Outdated Show resolved Hide resolved
}
inbox, _ := r.workers.LoadOrStore(roomID, &phony.Inbox{})
inbox.(*phony.Inbox).Act(nil, func() {
_ = msg.InProgress()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we call InProgress twice for the same message?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was because of the fact that the inbox might not process the task until later, so we reset the delivery timer when we start working on the event as well. As above though in the other comment, I don't think we actually need these at all, so I will remove them.

defer roomserverInputBackpressure.With(prometheus.Labels{"room_id": roomID}).Dec()
var inputRoomEvent api.InputRoomEvent
if err := json.Unmarshal(msg.Data, &inputRoomEvent); err != nil {
_ = msg.Term()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we do this here but in other cases we just Ack() or Nack()?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was to do with trying to queue things up using request-reply, rather than putting things straight onto the inboxes themselves for synchronous requests. I never finished it and decided it would be better to do that in a separate PR later, but the Term here is pretty much equivalent here to an Ack.

roomserver/internal/input/input.go Show resolved Hide resolved
roomserver/internal/input/input.go Show resolved Hide resolved
roomserver/internal/input/input.go Show resolved Hide resolved
Copy link
Member

@kegsay kegsay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - Let's do it!

neilalexander added a commit to matrix-org/sytest that referenced this pull request Jan 5, 2022
neilalexander added a commit to matrix-org/sytest that referenced this pull request Jan 5, 2022
@neilalexander neilalexander merged commit 161f145 into matrix-org:master Jan 5, 2022
@r3k2
Copy link

r3k2 commented Jan 6, 2022

Awesome! how do we change our current server/setup to test this?

@neilalexander
Copy link
Contributor

Take a look at the jetstream section in the version 2 sample config file and make sure you have a store_dir that is read-writable by the Dendrite process, and you should be able to just upgrade to it with any luck. Let me know if you have any problems.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants