Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: security: Limit ability of synthetic nodes to spread throughout network. Credit: Ziggurat Team #7824

Closed
4 of 9 tasks
mpguerra opened this issue Oct 25, 2023 · 27 comments
Assignees
Labels
A-network Area: Network protocol updates or fixes C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network I-remote-trigger Remote nodes can make Zebra do something bad S-needs-design Status: Needs a design decision S-needs-triage Status: A bug report needs triage

Comments

@mpguerra
Copy link
Contributor

mpguerra commented Oct 25, 2023

Motivation

"RT-S1 f1" from "Red Team report on Zcash testnet"

Zebra accepts peers from its peers much more easily when compared with the zcashd node. This behavior allows synthetic nodes (which do not do much in the network) to spread quite fast throughout the testnet network.
But at least Zebra doesn’t share all peers in the Addr message. It always shares a portion of its peers.

And

"RT-S1 f2" from "Red Team report on Zcash testnet"

Bad behavior
Zcashd also spreads synthetic nodes, but more slowly.
The node will always share all IPs to which it has connections.

Good behavior
Time to accept new peers is much slower because:

  • a zcashd node randomly processes a single node per Addr message.
  • A zcashd node spreads peers more carefully - it will never share more nodes from the same IP. Once it decides which node (identified by IP and port) will be shared, it always shares that same node. Even if zcashd has more connections to nodes from the same IP, it will only share a single node in the peer list.

Complex Code or Requirements

There is nothing particularly complex here.

Potential Fixes

Before starting work, discuss the impact of each potential fix on this vulnerability with the team, and decide on 1-3 fixes that have the greatest impact.

Required Changes

These changes must be made to fix the issues:

To deal with the drawbacks of these other changes, make this small change in the same PR as #1869:

  • When limiting the number of taken addresses, keep the un-taken addresses in the connection cache in case we need more
    • Advantages: Zebra will find peers faster if we keep all of the addresses
    • Disadvantages: Addresses could become outdated

Accepted Changes

These changes are useful but not strictly required:

Needs More Analysis

These changes were accepted, but they've since turned out to be more complicated than expected:

  • Always keep the latest unique addresses from each connected peer, regardless of which address message they were sent in
    • required analysis: before making this change, check how often the cache is truncated. If it is only rarely truncated, this change isn't needed.
    • benefits: we always have the latest addresses from each peer, so we only drop excess addresses. There are no duplicate addresses in the cache. This improves connectivity.
    • drawbacks: complicates the code
  • Remove Option from MetaAddr.{services,untrusted_last_seen}
    • required analysis: before making this change, list every MetaAddrChange that uses None for either of these fields, and decide on acceptable defaults
    • benefits: code is simpler
    • drawbacks: we have to supply defaults, but those defaults are probably correct (or at least acceptable)

Rejected Changes

Changes that aren't needed or don't work
  • When a peer is disconnected, send its cached addresses to the address book as NeverAttemptedAlternate (or a new status that is a lower priority than failed, or move NeverAttemptedAlternate to a lower priority than failed)
    • benefits: we don't drop addresses, so we can use them later if we can't connect to any other addresses
    • drawbacks: peers can fill our address book by connecting, sending addresses, then disconnecting
  • send any excess addresses to the address book as NeverAttemptedAlternate when they're received, but ignore them in the outbound connector until there haven't been any other candidates for a few minutes
    • benefits: never loses addresses, seeks new addresses from different peers before using all the addresses from one peer
    • drawbacks: imitates the existing reconnection attempt ordering, delays could cause hangs, complicates outbound connector

Testing

  • Check that we are only taking a few addresses from each peer
  • Check that repeated getaddr requests return the same results
  • Manual inspection of the version address sending code
  • Manual inspection of the address cache code
  • Existing integration testing

Related Work

@mpguerra mpguerra added C-enhancement Category: This is an improvement S-needs-triage Status: A bug report needs triage P-Medium ⚡ I-remote-node-overload Zebra can overload other nodes on the network A-network Area: Network protocol updates or fixes labels Oct 25, 2023
@mpguerra mpguerra added this to Zebra Oct 25, 2023
@github-project-automation github-project-automation bot moved this to 🆕 New in Zebra Oct 25, 2023
@teor2345 teor2345 added C-security Category: Security issues I-remote-trigger Remote nodes can make Zebra do something bad and removed C-enhancement Category: This is an improvement labels Oct 25, 2023
@teor2345 teor2345 changed the title feature: Limit ability of synthetic nodes to spread throughout network. Credit: Ziggurat Team security: Limit ability of synthetic nodes to spread throughout network. Credit: Ziggurat Team Oct 25, 2023
@teor2345
Copy link
Contributor

A good first step would be listing our existing defences against this attack. Note that some of Zebra's behaviour in the report is outdated.

@mpguerra who would you like to do that work?

@teor2345
Copy link
Contributor

Our previous analysis of this issue is in ticket #1869, where we listed our defences at the time, and decided they were good enough.

@teor2345
Copy link
Contributor

I think this was closed by mistake

@teor2345 teor2345 reopened this Oct 26, 2023
@teor2345 teor2345 added the S-needs-design Status: Needs a design decision label Oct 30, 2023
@mpguerra
Copy link
Contributor Author

mpguerra commented Nov 7, 2023

Hey team! Please add your planning poker estimate with Zenhub @arya2 @oxarbitrage @teor2345 @upbqdn

@teor2345
Copy link
Contributor

teor2345 commented Nov 7, 2023

I estimated this ticket assuming that we'd choose 1-3 potential fixes from the list of suggested fixes, and not do all of them.

@arya2 arya2 self-assigned this Nov 9, 2023
@arya2
Copy link
Contributor

arya2 commented Nov 9, 2023

  • Limit the number of addresses taken from any one peer to less than half the outbound connection limit
  • Keep the un-taken addresses from that peer in case we need more, always keep the latest N unique addresses from each connected peer

Could we do just these fixes here, or re-open #1869 for these fixes and convert this one to a tracking issue if we want to do more?

When a peer is disconnected, send its addresses as NeverAttemptedAlternate (or a new status that is a lower priority than failed, or move NeverAttemptedAlternate to a lower priority than failed)

It could also send any excess addresses as NeverAttemptedAlternate when they're received, but ignore them in the outbound connector until there haven't been any other candidates for a few minutes.

@teor2345
Copy link
Contributor

teor2345 commented Nov 9, 2023

When a peer is disconnected, send its addresses as NeverAttemptedAlternate (or a new status that is a lower priority than failed, or move NeverAttemptedAlternate to a lower priority than failed)

It could also send any excess addresses as NeverAttemptedAlternate when they're received, but ignore them in the outbound connector until there haven't been any other candidates for a few minutes.

If the goal is to delay connecting to these peers until we've tried other more reliable/secure peers, then marking them as alternate already does that. So I'm not sure I want to add more code that introduces similar behaviour using a different mechanism. That seems like it would make any bugs harder to diagnose.

impl Ord for PeerAddrState {
/// `PeerAddrState`s are sorted in approximate reconnection attempt
/// order, ignoring liveness.
///
/// See [`CandidateSet`] and [`MetaAddr::cmp`] for more details.
///
/// [`CandidateSet`]: super::peer_set::CandidateSet
fn cmp(&self, other: &Self) -> Ordering {
use Ordering::*;
match (self, other) {
_ if self == other => Equal,
// We reconnect to `Responded` peers that have stopped sending messages,
// then `NeverAttempted` peers, then `Failed` peers
(Responded, _) => Less,
(_, Responded) => Greater,
(NeverAttemptedGossiped, _) => Less,
(_, NeverAttemptedGossiped) => Greater,
(NeverAttemptedAlternate, _) => Less,
(_, NeverAttemptedAlternate) => Greater,
(Failed, _) => Less,
(_, Failed) => Greater,
// These patterns are redundant, but Rust doesn't assume that `==` is reflexive,
// so the first is still required (but unreachable).
(AttemptPending, _) => Less,
//(_, AttemptPending) => Greater,
}
}
}

I'm also cautious about adding more state to a system that's already complex. This part of the code has a history of hangs, so I'm particularly cautious of adding further delays.

We already use the version message and remote IP address of inbound connections to create alternate peers. So this change would also change the behaviour of those addresses.

@teor2345
Copy link
Contributor

teor2345 commented Nov 9, 2023

Could we do just these fixes here, or re-open #1869 for these fixes and convert this one to a tracking issue if we want to do more?

I've re-opened and updated #1869, some of the designs in it actually made this problem worse. (So I'm glad we waited this long to implement it.)

Before we start work or open tickets, let's:

  • add any other suggested fixes to the list
  • discuss the impact of each potential fix on this vulnerability with the team, and decide on 1-3 fixes that have the greatest impact

I've added your suggestion to the list, and added some benefits and drawbacks.

@teor2345
Copy link
Contributor

teor2345 commented Nov 9, 2023

What do you think of this suggestion?

Limit the number of addresses taken from any one peer to less than half the outbound connection limit

I can't see anywhere we limit the number of peers, so the current limit is the 1000 peer protocol limit.

@arya2
Copy link
Contributor

arya2 commented Nov 10, 2023

What do you think of this suggestion?

Limit the number of addresses taken from any one peer to less than half the outbound connection limit

I can't see anywhere we limit the number of peers, so the current limit is the 1000 peer protocol limit.

It looks like the only strictly-required fix for this issue and it's important that addresses be selected at random.

@teor2345
Copy link
Contributor

Ok, I've marked that as a required change, and added some benefits and drawbacks.

These changes seem similar, so let's discuss them together. What do you think their benefits and drawbacks are?

  • Keep the un-taken addresses from that peer in case we need more
  • Always keep the latest N unique addresses from each connected peer

@teor2345 teor2345 self-assigned this Nov 13, 2023
@arya2
Copy link
Contributor

arya2 commented Nov 13, 2023

These changes seem similar, so let's discuss them together. What do you think their benefits and drawbacks are?

  • Keep the un-taken addresses from that peer in case we need more

Zebra will find peers faster if we keep all of the addresses, and will drop the NeverAttemptedAlternate addresses first when the address book is full.

May use more memory if stored per connection.

  • Always keep the latest N unique addresses from each connected peer

I'm not sure what this means.

When a peer is disconnected, send its addresses as NeverAttemptedAlternate (or a new status that is a lower priority than failed, or move NeverAttemptedAlternate to a lower priority than failed)

Why send the addresses when a peer is disconnected rather than when the addresses are received?

@teor2345
Copy link
Contributor

These changes seem similar, so let's discuss them together. What do you think their benefits and drawbacks are?

  • Keep the un-taken addresses from that peer in case we need more

Zebra will find peers faster if we keep all of the addresses, and will drop the NeverAttemptedAlternate addresses first when the address book is full.

Can you add this advantage to the ticket?
Are there any disadvantages?

May use more memory if stored per connection.

50 bytes for 1000 addresses for 200 connections is still only 10MB, so I think we're fine here.

  • Always keep the latest N unique addresses from each connected peer

I'm not sure what this means.

At the moment, we keep the last unsolicited address message from each peer, ignoring messages with only one address. (Unless we have no addresses at all.) We drop previously stored addresses when we get a new message.

Then if we query that peer, we take the entire message and add it to our address book.

But in this new design we're taking some addresses and leaving some behind. So we might want to add extra unsolicited addresses to the start of our stored addresses (like an OrderedMap), rather than replacing all the unused addresses. We'd need to limit the number of addresses though.

When a peer is disconnected, send its addresses as NeverAttemptedAlternate (or a new status that is a lower priority than failed, or move NeverAttemptedAlternate to a lower priority than failed)

Why send the addresses when a peer is disconnected rather than when the addresses are received?

Because our goal was to only put messages in our address book if we requested them from a peer, rather than letting peers add to our address book on their schedule. If peers can modify our internal address book by sending unsolicited address messages, that makes it harder to implement securely.

But sending addresses when the peer is disconnected also has this issue, so maybe we shouldn't do this at all. Because a peer could connect, send addresses, and disconnect to get them in our address book.

@teor2345
Copy link
Contributor

teor2345 commented Nov 14, 2023

So I think that just leaves these proposed changes for discussion:

I think we agree the address book query limit is an essential part of this ticket. Did you want to add advantages and disadvantages in the ticket?

Also, should it be per-connection or an overall limit?
Edit: or both?

@arya2
Copy link
Contributor

arya2 commented Nov 14, 2023

Can you add this advantage to the ticket?

Yep, added.

Are there any disadvantages?

Nothing noteworthy, a little complexity.

But in this new design we're taking some addresses and leaving some behind. So we might want to add extra unsolicited addresses to the start of our stored addresses (like an OrderedMap), rather than replacing all the unused addresses. We'd need to limit the number of addresses though.

Ohh.

Advantages:

  • obstructs a straightforward path for an adversary to fill the address book? (do other constraints that prevent this?)
  • occasionally avoids starving outbound connector of address candidates

Both of those seem crucial. I think this is also a required a change, but perhaps for another high-priority issue.

Disadvantages:

  • Nothing noteworthy, a little bit more complexity

Because our goal was to only put messages in our address book if we requested them from a peer, rather than letting peers add to our address book on their schedule. If peers can modify our internal address book by sending unsolicited address messages, that makes it harder to implement securely.

But sending addresses when the peer is disconnected also has this issue, so maybe we shouldn't do this at all. Because a peer could connect, send addresses, and disconnect to get them in our address book.

Ah, good idea.

I think it would be okay to send them immediately if the address book, when sent a new NeverAttemptedAlternate address, will only insert the new address if there is another address of the same PeerAddrState to remove. I think it should drop the new address when the address book is full and there are no other addresses with the same PeerAddrState.

I think we agree the address book query limit is an essential part of this ticket.

Yes, I think so.

Did you want to add advantages and disadvantages in the ticket?

I can't seem to think of any disadvantages, but I added the advantages.

Also, should it be per-connection or an overall limit?

I think it should be per-connection, I don't think the default values should be a problem for systems that are already able to run Zebra., and it's a relatively trivial amount of memory that I don't think will surprise users providing their own peerset_initial_target_size.

Edit: or both?

Or both, but as a separate issue, per-connection is enough for now.

@teor2345
Copy link
Contributor

But in this new design we're taking some addresses and leaving some behind. So we might want to add extra unsolicited addresses to the start of our stored addresses (like an OrderedMap), rather than replacing all the unused addresses. We'd need to limit the number of addresses though.

obstructs a straightforward path for an adversary to fill the address book? (do other constraints that prevent this?)

The current defence against an adversary filling the address book is:

  1. We only use addresses from a peer if we randomly choose that peer for an address request (we cache previous addresses but don't use them until requested)
  2. We rate-limit how often we request new addresses
  3. We only request new addresses on demand if we need more peers

These defences are already implemented.

And let's assume that we implement taking only a few addresses separately. So that advantage has already been assigned to that change.

So I think that the remaining advantage for storing a list of recent addresses is improved connectivity. Because we're keeping more addresses available if we need them and decide to query for them.

Also, should it be per-connection or an overall limit?
I think it should be per-connection, I don't think the default values should be a problem for systems that are already able to run Zebra., and it's a relatively trivial amount of memory that I don't think will surprise users providing their own peerset_initial_target_size.

Edit: or both?
Or both, but as a separate issue, per-connection is enough for now.

I'm not sure it is, there are easy to implement attacks that evade a per-connection limit. Can you think of one?

@teor2345
Copy link
Contributor

and will drop the NeverAttemptedAlternate addresses first when the address book is full.

I removed this part because it doesn't match Zebra's current behaviour. It is based on an idea from the ticket that we decided not to do. Currently these addresses are kept in the per-connection cache.

@teor2345
Copy link
Contributor

I added a new potential fix:

Put addresses from the peer's version message and remote IP in the connection cache, don't send them directly to the address book

Currently a peer can fill Zebra's address book by repeatedly connecting and sending a different IP in each version message. This seems like a trivial attack that's worth preventing.

It also removes the NeverAttemptedAlternate status (because we're not sending any peers with that status any more, we're just treating them as if they were gossiped) which simplifies the code a lot.

And we can remove the Option on the last seen time, because we'll have to fake a time for these IPs (gossiped peers always have times), and the current time is reasonable for a current connection.

@teor2345
Copy link
Contributor

@arya2 let's schedule a meeting to make the final decisions on what we're doing for this ticket? I think we have enough information and analysis now.

@arya2
Copy link
Contributor

arya2 commented Nov 15, 2023

Agreed.

I'm not sure it is, there are easy to implement attacks that evade a per-connection limit. Can you think of one?

Oh. It's going to evict any address from the address book to insert new addresses rather than drop the new address when it has a lower connection state.

Yeah, I could, that's a problem. An overall limit is a good idea.

@teor2345
Copy link
Contributor

Hmm, I think we might be talking about different changes here. #7823 is about rate-limiting how often peers can request addresses from Zebra.

It's good to have a strict per-connection limit before the request is sent to the inbound service, because that stops a single connection overloading the node. But reconnecting will reset that limit, so we also need a less strict overall limit in the inbound service.

Or we could rate-limit getaddr requests per connection, with a constant timer starting when the connection is successfully negotiated? That way we wouldn't need a global rate-limit, because reconnecting would still encounter the same time limit. Hmm, but then you could still do it in parallel across multiple peers.

Maybe we do need a global limit.

@teor2345
Copy link
Contributor

Put addresses from the peer's version message and remote IP in the connection cache, don't send them directly to the address book

Currently a peer can fill Zebra's address book by repeatedly connecting and sending a different IP in each version message. This seems like a trivial attack that's worth preventing.

A similar reconnection attack can be used to fill the address book via version messages or remote IPs, and NeverAttemptedAlternate.

@teor2345
Copy link
Contributor

teor2345 commented Nov 15, 2023

Hmm, I think we might be talking about different changes here. #7823 is about rate-limiting how often peers can request addresses from Zebra.

It's good to have a strict per-connection limit before the request is sent to the inbound service, because that stops a single connection overloading the node. But reconnecting will reset that limit, so we also need a less strict overall limit in the inbound service.

Or we could rate-limit getaddr requests per connection, with a constant timer starting when the connection is successfully negotiated? That way we wouldn't need a global rate-limit, because reconnecting would still encounter the same time limit. Hmm, but then you could still do it in parallel across multiple peers.

Maybe we do need a global limit.

Here's a much simpler design that still meets the goal of limiting how many of its addresses Zebra gives out:

  • In the inbound service, generate a random getaddr response once, and return a copy of it to every peer that makes a request
  • When the rate-limit time is reached, generate a new random response and return a copy to every peer

This means it doesn't matter how many connections or how many reconnections an adversary makes, they will still get the same set of addresses until we decide to change them. A similar design is used by DNS seeders to avoid revealing all their addresses.

@teor2345 teor2345 changed the title security: Limit ability of synthetic nodes to spread throughout network. Credit: Ziggurat Team Tracking: security: Limit ability of synthetic nodes to spread throughout network. Credit: Ziggurat Team Nov 16, 2023
@teor2345
Copy link
Contributor

teor2345 commented Nov 16, 2023

I updated "Always keep the latest unique addresses from each connected peer, regardless of which address message they were sent in" with:

  • required analysis: before making this change, check how often the cache is truncated. If it is only rarely truncated, this change isn't needed.

If we're not dropping a lot of addresses from the cache, there's no point switching to unique addresses, because the address book makes addresses unique anyway.

@mpguerra
Copy link
Contributor Author

un-scheduling, tracking issues should not be scheduled in a sprint, the issues that they are tracking should be scheduled instead.

@teor2345
Copy link
Contributor

I think this change is also more complicated than discussed, because there is still some code using None for these fields:

Remove Option from MetaAddr.{services,untrusted_last_seen}

  • required analysis: before making this change, list every MetaAddrChange that uses None for either of these fields, and decide on acceptable defaults

I've moved these both to a "more analysis needed" section. If we think they're worth doing, let's make optional tickets with needs-investigation tags?

@teor2345
Copy link
Contributor

I think we can close this as sufficiently complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-network Area: Network protocol updates or fixes C-security Category: Security issues I-remote-node-overload Zebra can overload other nodes on the network I-remote-trigger Remote nodes can make Zebra do something bad S-needs-design Status: Needs a design decision S-needs-triage Status: A bug report needs triage
Projects
Status: Done
Development

No branches or pull requests

3 participants