Warning
This repo was archived because Protocol Labs no longer operates Hydra Boosters for the IPFS network.
For more information, see: https://discuss.ipfs.tech/t/dht-hydra-peers-dialling-down-non-bridging-functionality-on-2022-12-01/15567
This repo contains the Terraform infrastructure that Protocol Labs uses for running Hydras, and operational runbooks and tooling.
You are free to look at and use this code, but Protocol Labs will provide no support for it and will not guarantee backwards compatibility.
https://protocollabs.grafana.net/d/lNGQTv9Zz/hydra-boosters
This can happen if there is some IO causing backpressure, such as excessive retries, slow connections, etc. In the past we've seen this when the DB gets overloaded and starts slowing down.
If the rate of incoming queries from peers is faster than the rate at which they can be completed, then goroutines start piling up and eventually the host runs out of memory.
To mitigate quickly, reduce the number of heads per host, and add more hosts to makeup for it. This will reduce the number of incoming queries on each host.
To find what's consuming all the memory, take a pprof dump (see the section below).
The hydra hosts use ECS container healthchecks, which periodically cURL the process to check for liveness. If the health check fails a few times in a row, ECS will signal recycle the container.
This can happen if e.g. the host is running very hot on CPU and is unresponsive to the health check. In this case, reduce the # heads per host and add hosts to make up for it, which will reduce the number of incoming queries on each host. See the section below for scaling the fleet.
The Grafana dashboard has a panel that shows the most recent error messages. You can customize the query to search for other logs (documentation).
Otherwise, the logs are stored in CloudWatch.
Some AWS-specific metrics are stored in CloudWatch (such as DynamoDB metrics), and some are scraped from the Hydra nodes into Prometheus. Both can be searched using Grafana.
The DynamoDB table has two capacities, read and write. They both have autoscaling enabled. Throttling can occur if:
- There is a sudden burst in traffic
- Auto-scaling should raise the limit within a few seconds and throttling should stop
- The auto-scaling upper limit has been reached
- Increase the upper limit
- The DynamoDB upper limit has been reached
- Open a quota increase request to increase the capacity limit of the table.
To scale up the fleet, adjust the Terraform variables.
Then deploy the change (see below for deploying Terraform).
Engage a netops engineer.
Deployment permissions are restricted, so engage a netops engineer.
Open a PR with the change, get it approved, push it, and then apply it with Terraform locally:
terraform apply
To setup your environment for running Terraform:
- Get your IAM user credentials and add them to
~/.aws/credentials
using thehydra-boosters
profile:
[hydra-boosters]
aws_access_key_id = ...
aws_secret_access_key = ...
- Install direnv
- Install asdf
- Install the Terraform plugin with asdf:
asdf plugin add terraform
- Switch to the directory containing
main.tf
(direnv should prompt you to allow) - Install Terraform:
asdf install
terraform init
/terraform plan
/terraform apply
as usual
ECS can inject an SSM agent into any running container so that you can effectively "SSH" into it.
- Setup your credentials for an IAM user/role that has SSM permissions
- Install AWS CLI
- Install the Session Manager plugin for AWS CLI
- Find the ECS task ID that you want to SSH into:
- Log in to the AWS Console
- Go to ECS
- Select the us-east-2 region
- Select Clusters -> hydra-test
- Select the Tasks tab
- The Task ID is the UUID in the first column
export TASK_ID=<task_id> CONTAINER=<hydra|grafana-agent>
aws ecs execute-command --region us-east-2 --task $TASK_ID --cluster hydra-test --container $CONTAINER --command '/bin/bash' --interactive
If you need to exfiltrate some data from a container, run the presigned-url.py <bucket> <key>
command to generate an S3 presigned URL, which can be used to upload a file using cURL:
- Locally with your configured AWS credentials:
python presigned-url.py <bucket> dropbox/dump.tar.gz
- Inside the container:
curl -T <local_file> '<presigned_url>'
Hydra heads receive broadcasts from peers about content they are providing. These drive the write traffic to DynamoDB.
Hydra heads also receive queries from IPFS peers looking for providers of content. These drive the read traffic to DynamoDB.
If a head receives a query for content that is not cached in DynamoDB, it performs "provider prefetching" to attempt to pre-cache the providers of the content for subsequent queries. The head immediately returns an empty set back to the peer and places the CID into a bounded queue for "provider prefetching". The queue is shared across all heads of a hydra. If the queue is full, then no prefetching is done for the CID, and it is discarded.
A goroutine worker pool processes the queue. A worker pulls a CID off the queue, performs a DHT query to find the providers of the CID, and caches them in DynamoDB. The query is limited by a timeout.
The Hydras aggregate Prometheus metrics, which are scraped by the Grafana Agent sidecar and pushed to Grafana Cloud. The credentials for doing this are stored in Secrets Manager and injected as environment variables by ECS.
The Hydras do not directly publish CloudWatch metrics, but ECS, DynamoDB, etc. do, which we use in our dashboards.
We don't want to use new Peer IDs every time a host restarts or recycles. So we deterministically set each host's peer ID...but this requires having a separate ECS task definition for each host, and so a separate ECS service for each host.
Ideally each host would "lease" offsets (lock them in a DB), which would enable us to autoscale more gracefully, but that hasn't been implemented.
DynamoDB is used to cache provider records. We used to cache these in a hosted PostgreSQL DB, but scaling that was complicated (details). To reduce the operational burden, we switched to DynamoDB, which has the following benefits for hydras:
- Horizontally scalable
- Has TTL built-in, removing the need for running periodic GCs ourselves
- No hidden dragons with lock contention, query performance, connection limits, etc.
A single DHT puts translates into a single PutItem request to DynamoDB.
A DHT query translates to >=1 Query requests to DynamoDB. The hydras follow the Query pagination until a pre-configured limit is reached, and then truncate the rest of the providers and return the list. The results are always sorted by TTL, so the most-recently-added providers are at the front of the list, and we truncate the least-recently-added providers. This is an important limit to prevent the hydras from spending too much time paginating queries (which would increase the latency for a query response and also create an availability risk).