Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next Hop-based routing with fallback to managed flooding #2856

Open
wants to merge 60 commits into
base: master
Choose a base branch
from

Conversation

GUVWAF
Copy link
Member

@GUVWAF GUVWAF commented Oct 2, 2023

Description

This adds a NextHopRouter for direct messages, which only relays if it is the next hop for a packet. The next hop is set by the current relayer of a packet, which bases this on information from a previous successful delivery to the destination via flooding.
Namely, in the PacketHistory, we keep track of (up to 3) relayers of a packet. When the ACK is delivered back to us via a node that also relayed the original packet, we use that node as next hop for the destination from then on. This makes sure that only when there’s a two-way connection, we assign a next hop. Both the ReliableRouter and NextHopRouter will do retransmissions (the NextHopRouter only 1 time). For the final retry, if no one actually relayed the packet, it will reset the next hop in order to fall back to the FloodingRouter again. Note that thus also intermediate hops will do a single retransmission if the intended next-hop didn’t relay, in order to fix changes in the middle of the route.

It is backwards compatible with all 2.x versions of Meshtastic, because for those nodes it will always fall back to flooding.

Implementation details

The next_hop and relay_node are added to the unencrypted PacketHeader. Since we only used 14 bytes out of the 16-byte header (it got this size due to memory alignment), the two unused bytes are used to save the last byte of the NodeNum of the next hop and current relayer. When hop_start was added (version 2.3), these bytes were set to 0, so we can use these bytes safely when hop_start is set.
The re-use of header space means that there is no additional over-the-air overhead, except that up to 2 retransmissions are needed when a change of next hop occurs. So, the benefit will be most pronounce for rather static meshes, but since the next hop is only set after a successful two-way connection is set up, and it falls back to flooding rather quickly, even for dynamic meshes there is likely a benefit to using this.

In terms of memory overhead, the next_hop has to be stored in the NodeDB (only 1 byte), and there are 4 additional bytes required per packet in the PacketHistory to store the next_hop and three relayers.

Using only 1 byte for the next_hop means that there is a chance that the last byte of two nodes match, and then they will both try to relay. However, that’s not a big issue as that would be similar to flooding. The chance that only the intended next hop relays depends on the amount of nodes that can hear the packet. If that are 10 nodes (that would be a lot), the chance that only the next hop relays is 83.7% ((255/256*254/256*253/256 ... 247/256*)*100), for 5 nodes the chance is 96.1% ((255/256*254/256*253/256*252/256)*100).

Examples

With this, you get rid of the unnecessary rebroadcasts caused by flooding like the one from node 0 as shown below. (Note that an arrow to a node means that it received it (so there might be multiple arrows for one packet), but not necessarily that it was addressed.)
image
With the NextHopRouter, node 0 doesn't try to relay:
image

Furthermore, it solves this issue of the current implementation where the wrong node (1, because it has the lowest SNR) is relaying:
image
With the NextHopRouter, due to randomness in the order of rebroadcasting, at some point the route will succeed and 2 will be set as next hop from then on.
image

Notes for reviewers

NextHopRouter inherits from the FloodingRouter. Since it also requires retransmissions, this logic is now moved from the ReliableRouter to the NextHopRouter, and the ReliableRouter inherits from the NextHopRouter.
Also, since the Router and NextHopRouter need to have access to the PacketHistory as well, the Router now inherits from it instead of only the FloodingRouter.

@loodydo
Copy link
Contributor

loodydo commented Oct 2, 2023

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

@github-actions
Copy link
Contributor

github-actions bot commented Oct 3, 2023

🤖 Pull request artifacts

file commit
pr2856-firmware-2.2.10.44dc270.zip 44dc270

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 3, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 3, 2023

I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

It depends on the amount of nodes that can hear the packet (so the amount of immediate neighbors). But indeed, the chance that then only the intended next hop relays is a bit different than I presented before. I've updated the description with two calculations.

src/mesh/RadioInterface.h Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
src/mesh/NextHopRouter.cpp Outdated Show resolved Hide resolved
@GUVWAF
Copy link
Member Author

GUVWAF commented Oct 13, 2023

I pushed a new commit to resolve @loodydo’s comments. Also added the hop limit setting of the original transmitter to the header flags (using 3 of the 4 currently unused bits), such that we can determine the amount of hops the packet has already traveled. This comes in handy to set the next hop for immediate neighbors, and I think it’s also useful for displaying in the apps. Currently it will always show the SNR/RSSI of nodes in the list, but this is actually the SNR/RSSI to the last relayer. In cases that a node is not an immediate neighbor, we could display the hops towards that node.

I also changed the way how the next_hop and relay_node are stored in the MeshPacket and NodeDB. Now only the last byte is stored, since only that byte is sent over the air. Besides, it mitigates the case where you might assign the wrong NodeNum if you happen to have multiple nodes with the same last byte in the NodeDB.

thebentern added a commit to meshtastic/artifacts that referenced this pull request Oct 13, 2023
@GUVWAF
Copy link
Member Author

GUVWAF commented Nov 26, 2023

I think this PR is ready to merge, except that it’s a breaking change unfortunately.
I did a test and it seems that the two unused bytes are not set to zero with current firmware. Meaning with this PR it would think someone set the next hop for a specific node and no one will relay the message.
We’ll have to wait for 3.0 with this.

Edit: The unused bytes are set to zero from 2.3 on, which is also when hop_start was introduced. So, if hop_start is set, we can safely use the bytes.

src/mesh/Router.cpp Outdated Show resolved Hide resolved
@GUVWAF GUVWAF added enhancement New feature or request and removed requires-protos A change in device that requires changes to protobufs labels Nov 19, 2024
@todd2982
Copy link

I like the concept of this a lot. The only thing I'd like to mention that I don't see so far is nodes on the move. If a user receives a message while in a car/plane/rowboat/etc., the best route back could have changed rapidly between receiving the message and typing a reply. This is even more of an issue if the sender, or nodes along the original route are also moving. I think some checks against GPS data for movement speed should go into deciding if failing back to flood routing is required.

This may not be feasible with the default GPS polling intervals, but moving through the landscape should be taken into account somehow. I also understand that flood routing is an automatic fallback, but if we can smartly choose the best option we should.

@GUVWAF
Copy link
Member Author

GUVWAF commented Nov 22, 2024

@todd2982 Thanks for your comment. It’s true that for moving nodes, the benefit is not so clear. However, the nodes would first already have to set up the next-hop based on a successful transmission where the ACK came back via the same node as the relayer. If afterwards the next-hop should be changed, there are only two (for the original transmitter) or one (for an intermediate hop) retransmissions needed in order to fall back to flooding. Since they only retransmit when nobody rebroadcasted yet, the wasted bandwidth is very limited, and would be much less than what can be gained by not using flooding, even if only used once.

Indeed using GPS data would be difficult, not only because not all nodes have a GPS, and those that do might not transmit location data frequently enough. You would need to incorporate the relative movement (with some threshold) between nodes, since when you’re moving with a convoy of buggy’s in the desert, the next-hop likely still remains the same. And even when there is relative movement the next-hop might remain the same, e.g., when you’re driving around a router on a mountain top.

Lastly, periodic position transmissions used for tracking mobile nodes or gathering other sensory data are broadcast messages, so they do not use next-hop routing.

So, incorporating GPS data would require quite some memory and processing power for a potential minor gain, or even a negative gain, so I don’t think it’s worth it.

@todd2982
Copy link

I understand. I just wanted to bring it up. Since I first read this on the project board that scenario was stuck in my head. GPS is what made sense to me, but I mainly just wanted to have the people smarter than me (I?) considering the same scenario.

I look forward to seeing how this works out!

@GUVWAF GUVWAF changed the title Next Hop-based routing with fallback to flooding Next Hop-based routing with fallback to managed flooding Jan 11, 2025
@GUVWAF GUVWAF added the 2.6 Planned for next point release label Jan 12, 2025
@fifieldt
Copy link
Contributor

@GUVWAF , do you have any suggestions on how one may test this?

@GUVWAF
Copy link
Member Author

GUVWAF commented Jan 13, 2025

@fifieldt You could flash it on a couple of devices and send some DMs or traceroutes. If all is well, these should be at least as reliable as before, and if so, it should also be more efficient, but that would be most pronounced if all nodes are updated.
Especially going from e.g. a direct connection to one with a hop in between would be interesting to test. In my testing it switched seamlessly.

If you want to see it's really assigning the next-hops, you will have to look at the serial logs.

@medentem
Copy link
Contributor

@GUVWAF now that moving nodes and asymmetric links are modeled in the simulator, maybe we should test how this performs under those conditions...

@GUVWAF
Copy link
Member Author

GUVWAF commented Jan 21, 2025

@medentem I did some crude simulations using the interactive simulator by letting it assign next-hops and then remove the nodes (using remove <id>) to ensure it falls back to flooding after 1 retransmission. I’ve also hacked in asymmetric links (by enlarging the Tx power for some nodes) to make sure it only assigns a next-hop when you can hear and transmit towards the node.

Indeed I’ve not done large-scale simulations with moving nodes, which might be worth doing if I (or someone) finds the time. However, the benefit of using dedicated relayers for DMs is much more obvious. Even if you only use the next-hop once and the next time you have to fall back to flooding again because the node moved, the penalty is just one retransmission, whereas by not using flooding you’ll have likely saved multiple transmissions.

@wlockwood
Copy link

wlockwood commented Jan 21, 2025

I have started looking. May take me some to to finish. I am not a statistician so I asked ChatGPT about the probability that the last byte of two nodes match. It came up with the result of 16.31%. Considering that this is a non breaking change I would think that the risk of two nodes broadcasting is still worth it.

I don't think that's accurate. If we assume that the node IDs are completely random - they're probably not! - it would be a 1 out of 256 (since one byte tracks 0-255), which is a 0.4% chance. Given that these are hardware values assigned by the manufacturer, there's a good chance that they are not random and instead incrementing, but since this is the final byte and thus should change the most frequently if it's the least significant byte, it should be close enough to random for the purposes of this PR. ... I wanted to validate that, but I don't have a convenient data source to do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2.6 Planned for next point release enhancement New feature or request pinned Exclude from stale processing
Projects
None yet
Development

Successfully merging this pull request may close these issues.