Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct file transfer is slow #6599

Open
Stebalien opened this issue Aug 21, 2019 · 5 comments
Open

Direct file transfer is slow #6599

Stebalien opened this issue Aug 21, 2019 · 5 comments
Labels
kind/bug A bug in existing code (including security flaws) topic/meta Topic meta

Comments

@Stebalien
Copy link
Member

Stebalien commented Aug 21, 2019

This is a meta-issue explaining why direct file transfer may be slower than tools like SCP and the related issues.

This issue assumes the following setup:

  • Two directly connected nodes.
  • One node has a large file or directory tree.
  • The other node is trying to pin that large file or directory tree.

Explicitly, this issue is not about finding peers that have content (#6383).

There may or may not be other connected nodes that have the content.

Background

IPFS chunks files/directories into smaller pieces and links them together into a merkledag (a generalized merkle-tree). We call these pieces "blocks".

To fetch a file or a directory tree, IPFS currently uses a protocol called bitswap to ask for each block in the file or directory tree. Specifically, IPFS (currently) uses bitswap to tell a subset of our connected peers when we "want" a block. The set of blocks we're looking for is called our "wantlist". If the peer has the block, they send it to us. At the moment, peers that don't have the block stay silent.

Dag Traversal

We have two issues with dag traversal that may limit throughput.

ipfs get

When calling ipfs get to fetch a file, go-ipfs will traverse the merkledag in-order (approximately). We do a bit of pre-fetching but our prefetching algorithm isn't optimal: ipfs/go-ipld-format#53

ipfs pin

When calling ipfs pin add, traverse the graph in parallel with 32 worker goroutines, each fetching one block at a time. Unfortunately, this means we can only request 32 blocks at a time when pinning a file. Given 256KiB blocks, that's a maximum throughput of 8MiB/RTT where RTT is the latency between the two nodes. Assuming an RTT of ~100ms, that's a maximum throughput of 80MiB/s (or ~670mbps). In other words, it won't max out a gigabit connection.

Unfortunately, this is also an optimistic upper-bound. If we're downloading a large directory tree of small files, these blocks will be much smaller (e.g., 1KiB). In that case, our throughput will be more like 2-3mbps.

Bitswap Issues

Bitswap works on a block-by-block basis. That is, a layer on-top-of bitswap will ask bitswap for a block, deserialize the block, determine if that block is some kind of intermediate block, and then ask bitswap for that block's children (if any).

Round Trips

One unfortunate side effect is that traversing a path (/ipfs/QmFoobar/a/b/c/d) requires one round trip per path component. The path resolution logic will ask bitswap for the block identified by the CID (content identifier) QmFoobar, deserialize the block into a directory, lookup "a" within that directory, find the associated content identifier (let's say QmA), and then ask bitswap for QmA. We then repeat this process for b, c, and d.

A more efficient protocol would simply send the entire path to the remote peer and ask for the entire file all at once. We are currently working on such a protocol (https://github.com/ipfs/go-graphsync) but it may be a while until it's feature complete enough to be included in go-ipfs.

Duplicate Blocks

STATUS: We have now fixed the protocol limitations here. We can now ask if a peer has a block and ask peers to tell us immediately if they don't have a block. Now, when we try to download a block, we ask:

  1. A single peer for the block, and ask them to tell us if they don't have it.
  2. Other peers if they have the block.

This means:

  1. If the peer from step 1 doesn't have the block, we can look elsewhere in 1RTT instead of waiting for some timeout.
  2. Because we ask other peers if they have the block in parallel, the second request for a block should almost always succeed.
  3. Because we can ask peers if they have a block, we can avoid downloading the block multiple times.

Previous Text:

When fetching blocks from more than one peer, bitswap will occasionally fetch the same block from multiple peers in case one doesn't respond. We do this because:

  1. Bitswap doesn't currently have any form of acknowledgement. That is, if peer A tells peer B that it "wants" a block, peer B will only respond (by sending the block) if and when it gets the block. It won't tell peer A that it doesn't have the block. This means that we can't currently tell the difference between a peer that's taking a while to respond and a peer that doesn't have the block and will never respond.
  2. When asking a peer for a block, we have no way to know if the peer actually has it. We can guess based on whether or not the peer had a related block but that peer may only have part of the directory tree/file that we're trying to download.

We're currently considering augmenting the protocol with additional messages to solve both of these issues.

Finally, the current logic that decides how many peer from which we should request a block is will over-optimize for asking multiple peers (ipfs/go-bitswap#120). If multiple peers have the block we're looking for, we'll have a 50% duplicate block overhead in the best case scenario.

Wantlist Size Limitations

When asking a single peer for a set of blocks, we currently only ask for at most 32 at a time. We do this for several reasons:

  1. We don't want to force the peer to keep a large "wantlist" in memory.
  2. We want to be able to efficiently parallelize retrieving data from multiple peers without getting too many duplicate blocks. If we send our entire wantlist to a single peer, we'd have to cancel part of that wantlist before sending it to a new, faster peer. We can do this, but we don't have any way to know when the remote peer received the cancel and what blocks may currently be in transit.

We're considering adding a protocol extension (insert link here when we finish the writeup) to better determine the correct wantlist size per peer.

Worker Limitations

Bitswap currently sends out 6 blocks in parallel using 6 workers to avoid consuming too much memory.

We should probably bump this up, possibly scaling with available memory (ipfs/boxo#116).

Libp2p

Another issue is overhead in libp2p itself.

Small Packets

At the moment, go-libp2p doesn't do a good job of combining small writes into larger TCP/IP packets. This can significantly limit the available network throughput.

We have fixes for this issue but had to temporarily revert them as they exposed a bug in secio (libp2p/go-libp2p#644) that would have lead to network interoperability issues. Once the network has upgraded, we'll turn this feature back on.

STATUS: We've tried to revive these fixes but ran into other performance issues in go's scheduler.

Slow Stream Multiplexers

The default go-libp2p stream multiplexer, yamux, is a bit slow. This comes from the fact that it has a fixed send window per stream so bandwidth is pretty limited on a per-stream basis.

However, due to how bitswap works, I'm doubtful that this is an issue:

  1. The default window size is currently 16MiB.
  2. Each message is at most 2MiB.

This means the fixed window shouldn't matter when sending blocks.

It's also unlikely that this preventing us from sending wantlist entries fast enough. On connections with a 100ms latency, we should be able to request ~1,677,720 blocks per second. At 1KiB per block (low bound), that's ~1GiB per second (8x a gigabit link).

UPDATE 1: Yamux is an issue, at least on a 10gbps link.

UPDATE 2:

  1. We have fixed some issues in how windows are managed that should significantly improve the issue.
  2. We have fixed some bugs that could cause yamux connections to die when under heavy load.
@Stebalien Stebalien added kind/bug A bug in existing code (including security flaws) topic/meta Topic meta labels Aug 21, 2019
@aschmahmann
Copy link
Contributor

as they exposed a bug in secio (#6599)

@Stebalien the link in there is for this issue, not the correct issue.

All
Due to the way bitswap works, I'm doubtful that this is slowing bitswap down at all. Specifically,

Looks like Github cut you off there 😦

@Stebalien
Copy link
Member Author

(thanks)

@bam80
Copy link

bam80 commented Feb 27, 2021

Just curious what is the situation for now?

@Stebalien
Copy link
Member Author

We have:

  • Improved the stream multiplexers.
  • Updated the bitswap protocol to allow asking for who has the block. This means:
    • We should generally get blocks in one round-trip.
    • We should get very little duplicate data (we usually ask for blocks from exactly one peer).
    • Worst case, we should get blocks in 2 round-trips (as long as we're connected to a peer with the block).

However, there's still a lot of room for improvement.

@Stebalien
Copy link
Member Author

I've updated the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) topic/meta Topic meta
Projects
None yet
Development

No branches or pull requests

3 participants