Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DHT announcements are not happening when using a large blockstore #9722

Closed
3 tasks done
acejam opened this issue Mar 14, 2023 · 14 comments
Closed
3 tasks done

DHT announcements are not happening when using a large blockstore #9722

acejam opened this issue Mar 14, 2023 · 14 comments
Assignees
Labels
kind/bug A bug in existing code (including security flaws) kind/test Testing work need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up

Comments

@acejam
Copy link

acejam commented Mar 14, 2023

Checklist

Installation method

ipfs-update or dist.ipfs.tech

Version

Kubo version: 0.18.1
Repo version: 13
System version: amd64/linux
Golang version: go1.19.1

Config

{
  "API": {
    "HTTPHeaders": {}
  },
  "Addresses": {
    "API": "/ip4/0.0.0.0/tcp/5001",
    "Announce": [
      "/ip4/my.public.ip.address/tcp/4001",
      "/ip4/my.public.ip.address/udp/4001/quic"
    ],
    "AppendAnnounce": [],
    "Gateway": "/ip4/0.0.0.0/tcp/8080",
    "NoAnnounce": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "Swarm": [
      "/ip4/0.0.0.0/tcp/4001",
      "/ip4/0.0.0.0/udp/4001/quic",
      "/ip4/0.0.0.0/udp/4001/quic-v1",
      "/ip4/0.0.0.0/udp/4001/quic-v1/webtransport"
    ]
  },
  "AutoNAT": {
    "ServiceMode": "disabled"
  },
  "Bootstrap": [
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmNnooDu7bfjPFoTZYxMNLWUQJyrVwtbZg5gBMjTezGAJN",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmQCU2EcMqAqQPR2i9bChDtGNJchTbq5TbXJJ16u19uLTa",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmbLHAnMoJPWSCR5Zhtx6BHJX9KiKNN6tpvbUcqanj75Nb",
    "/dnsaddr/bootstrap.libp2p.io/p2p/QmcZf59bWwK5XFi76CZX8cbJ4BhTzzA3gU1ZjYZcYW3dwt",
    "/ip4/104.131.131.82/tcp/4001/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ",
    "/ip4/104.131.131.82/udp/4001/quic/p2p/QmaCpDMGvV2BGHeYERUEnRQAwe3N8SzbUtfsmvsqQLuvuJ"
  ],
  "DNS": {
    "Resolvers": {}
  },
  "Datastore": {
    "BloomFilterSize": 268435456,
    "GCPeriod": "1h",
    "HashOnRead": false,
    "Spec": {
      "mounts": [
        {
          "child": {
            "path": "blocks",
            "shardFunc": "/repo/flatfs/shard/v1/next-to-last/3",
            "sync": false,
            "type": "flatfs"
          },
          "mountpoint": "/blocks",
          "prefix": "flatfs.datastore",
          "type": "measure"
        },
        {
          "child": {
            "compression": "none",
            "path": "datastore",
            "type": "levelds"
          },
          "mountpoint": "/",
          "prefix": "leveldb.datastore",
          "type": "measure"
        }
      ],
      "type": "mount"
    },
    "StorageGCWatermark": 90,
    "StorageMax": "45TB"
  },
  "Discovery": {
    "MDNS": {
      "Enabled": false
    }
  },
  "Experimental": {
    "AcceleratedDHTClient": true,
    "FilestoreEnabled": false,
    "GraphsyncEnabled": false,
    "Libp2pStreamMounting": false,
    "P2pHttpProxy": false,
    "StrategicProviding": false,
    "UrlstoreEnabled": false
  },
  "Gateway": {
    "APICommands": [],
    "HTTPHeaders": {
      "Access-Control-Allow-Headers": [
        "X-Requested-With",
        "Range",
        "User-Agent"
      ],
      "Access-Control-Allow-Methods": [
        "GET"
      ],
      "Access-Control-Allow-Origin": [
        "*"
      ]
    },
    "NoDNSLink": false,
    "NoFetch": true,
    "PathPrefixes": [],
    "PublicGateways": null,
    "RootRedirect": "",
    "Writable": false
  },
  "Identity": {
    "PeerID": "12D3KooWNorHNz1DghmWGhCPRUVvs45vsdHi5CFnwVjaWo8wKevw"
  },
  "Internal": {
    "Bitswap": {
      "EngineBlockstoreWorkerCount": 2500,
      "EngineTaskWorkerCount": 500,
      "MaxOutstandingBytesPerPeer": 1048576,
      "TaskWorkerCount": 500
    }
  },
  "Ipns": {
    "RecordLifetime": "",
    "RepublishPeriod": "",
    "ResolveCacheSize": 128
  },
  "Migration": {
    "DownloadSources": [],
    "Keep": ""
  },
  "Mounts": {
    "FuseAllowOther": false,
    "IPFS": "/ipfs",
    "IPNS": "/ipns"
  },
  "Peering": {
    "Peers": [
      "list_of_peers_using_internal_dns"
    ]
  },
  "Pinning": {
    "RemoteServices": {}
  },
  "Plugins": {
    "Plugins": null
  },
  "Provider": {
    "Strategy": ""
  },
  "Pubsub": {
    "DisableSigning": false,
    "Router": ""
  },
  "Reprovider": {
    "Interval": "12h",
    "Strategy": "all"
  },
  "Routing": {
    "Routers": null,
    "Methods": null
  },
  "Swarm": {
    "AddrFilters": [
      "/ip4/10.0.0.0/ipcidr/8",
      "/ip4/100.64.0.0/ipcidr/10",
      "/ip4/169.254.0.0/ipcidr/16",
      "/ip4/172.16.0.0/ipcidr/12",
      "/ip4/192.0.0.0/ipcidr/24",
      "/ip4/192.0.2.0/ipcidr/24",
      "/ip4/192.168.0.0/ipcidr/16",
      "/ip4/198.18.0.0/ipcidr/15",
      "/ip4/198.51.100.0/ipcidr/24",
      "/ip4/203.0.113.0/ipcidr/24",
      "/ip4/240.0.0.0/ipcidr/4",
      "/ip6/100::/ipcidr/64",
      "/ip6/2001:2::/ipcidr/48",
      "/ip6/2001:db8::/ipcidr/32",
      "/ip6/fc00::/ipcidr/7",
      "/ip6/fe80::/ipcidr/10"
    ],
    "ConnMgr": {
      "GracePeriod": "20s",
      "HighWater": 2000,
      "LowWater": 1500,
      "Type": "basic"
    },
    "DisableBandwidthMetrics": true,
    "DisableNatPortMap": true,
    "RelayClient": {
      "Enabled": false
    },
    "RelayService": {
      "Enabled": false
    },
    "ResourceMgr": {},
    "Transports": {
      "Multiplexers": {},
      "Network": {},
      "Security": {}
    }
  }
}

Description

I have observed that DHT announcements are not happening when pinning content from a node that has a large blockstore. I'm using the AcceleratedDHTClient on 0.18.1, but previously saw this behavior on 0.17 as well. I'm seeing this across multiple servers, and the pinned content is being stored long term.

ipfs bitswap provide runs for about 5 minutes, and does not return any error codes. However when querying ipfs dht findprovs for any randomly selected CID, no peer ID's are returned. If I manually run ipfs dht provide <cid>, a peer ID is properly returned. However, it is my understanding that this command invokes the non-Accelerated DHT client.

DHT announcement information:

$ ipfs stats provide
TotalProvides:          299M (299,055,054)
AvgProvideDuration:     3µs
LastReprovideDuration:  5m37.700645s
LastReprovideBatchSize: 99M (99,685,018)

Blockstore information:

NumObjects: 99685018
RepoSize:   8.8 TB
StorageMax: 45 TB
RepoPath:   /data
Version:    fs-repo@13
@acejam acejam added kind/bug A bug in existing code (including security flaws) need/triage Needs initial labeling and prioritization labels Mar 14, 2023
@lidel
Copy link
Member

lidel commented Mar 20, 2023

(reposting comment from slack) @acejam noted this is still a problem in 0.19.0-rc1

@lidel lidel added kind/test Testing work need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up and removed need/triage Needs initial labeling and prioritization labels Mar 20, 2023
@lidel lidel moved this to 🥞 Todo in IPFS Shipyard Team Mar 21, 2023
@acejam
Copy link
Author

acejam commented Mar 21, 2023

Quick update: We are starting to see more content drop out of the DHT. This appears to be causing some impact to our production services.

@yiannisbot
Copy link
Member

yiannisbot commented Mar 23, 2023

Not sure if you've discussed this elsewhere, but have you tried (temporarily) disabling the resource manager (or increase the limits there)? Just to make sure it's not because of this.

@guillaumemichel
Copy link

guillaumemichel commented Mar 23, 2023

The context is being cancelled a 5 minutes timeout. The timeout value is defined at

var cRouters []*routinghelpers.ParallelRouter
for _, v := range routers {
cRouters = append(cRouters, &routinghelpers.ParallelRouter{
Timeout: 5 * time.Minute,
IgnoreError: true,
Router: v.Routing,
})
}
or
routers = append(routers, &routinghelpers.ParallelRouter{
Router: dhtRouting,
IgnoreError: false,
Timeout: 5 * time.Minute, // https://github.com/ipfs/kubo/pull/9475#discussion_r1042501333
ExecuteAfter: 0,
})

This was probably introduced with #9475


The quick fix is to temporarily increase these Timeout values (e.g to 10 * time.Hour).

@acejam
Copy link
Author

acejam commented Mar 23, 2023

@guillaumemichel I can confirm that bumping this value to 10 * time.Hour fixed this issue for us. Thank you for the suggestion. I made this change using version 0.19.

Records are now being properly announced, and the stats output now shows:

TotalProvides:          99M (99,685,018)
AvgProvideDuration:     53µs
LastReprovideDuration:  1h28m18.881264s
LastReprovideBatchSize: 99M (99,685,018)

@guillaumemichel
Copy link

I don't understand why an arbitrary constant timeout is defined for all Content Routers. IMO this timeout should be content router specific, if it is needed at all. We need to make sure that the fullrt provide operation doesn't time out, or has a large enough timeout value.

@aschmahmann
Copy link
Contributor

aschmahmann commented Mar 23, 2023

Yeah, I think there are two main issues here:

  1. There's a hard-coded timeout despite their being two different layers of timeouts definable in the config https://github.com/ipfs/kubo/blob/1f5763f7877a59e933fa729afdfbf26d253b0fe0/docs/config.md#routingrouters-parameters
    • If this was not true then this would have been resolvable with a config change rather than a code change
  2. The default limit for the *Many operations should not be the same as for the individual ones
    • Note: perhaps reasonably the *Many operations are not exposed in the config at all, which is nice for reducing user complexity but means the defaults have to actually be good
    • Or at least the out-of-the-box defaults here should be better (e.g. with ProvideMany having a timeout matching the reprovide interval or the default one if reproviding was turned off)

@BigLep
Copy link
Contributor

BigLep commented Mar 24, 2023

2023-03-24 maintainer conversation
Plan to include this for 0.19.1: #9754

  • We're not just fixing ProvideMany:
    func (c *Composer) ProvideMany(ctx context.Context, keys []multihash.Multihash) error {
  • @guseggert suggsted of allowing 0 to mean "no timeout" and then have default of 0 for all operations
  • @guseggert will take / @lidel review

We think the above will stop the bleeding and be good enough for the foreseeable future.

@guseggert
Copy link
Contributor

Made the change to the router composers in libp2p/go-libp2p-routing-helpers#72, after merging next step is to plumb it into Kubo.

@BigLep BigLep moved this from 🥞 Todo to 🏃‍♀️ In Progress in IPFS Shipyard Team Mar 28, 2023
@guseggert
Copy link
Contributor

guseggert commented Mar 30, 2023

I've merged a fix into Kubo here: a09c8df

This is hard to reproduce, so would you be able to test it out and see if the issue is fixed for you?

If we can get this validated, we can include it in v0.19.1 which is being released soon. See #9754

@BigLep
Copy link
Contributor

BigLep commented Mar 31, 2023

@acejam : are you able to confirm an improvement here? I'd like to include this with 0.19.1 on Monday.

@acejam
Copy link
Author

acejam commented Apr 1, 2023

@BigLep We're testing the merged fix now.

@acejam
Copy link
Author

acejam commented Apr 1, 2023

@BigLep @guseggert I can confirm that the merged fix appears to work. We have been running 55587d8 and provides using the AcceleratedDHTClient no longer timeout after 5 minutes:

/ # ipfs stats provide
TotalProvides:          99M (99,707,542)
AvgProvideDuration:     68µs
LastReprovideDuration:  1h53m31.764363s
LastReprovideBatchSize: 99M (99,707,541)

Thank you!

@BigLep
Copy link
Contributor

BigLep commented Apr 6, 2023

Thanks @acejam .

Closing since this was released in 0.19.1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug A bug in existing code (including security flaws) kind/test Testing work need/analysis Needs further analysis before proceeding P1 High: Likely tackled by core team if no one steps up
Projects
No open projects
Archived in project
Development

No branches or pull requests

7 participants