Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand storage for Mainnet fleet #219

Closed
jakubgs opened this issue Jan 3, 2025 · 6 comments
Closed

Expand storage for Mainnet fleet #219

jakubgs opened this issue Jan 3, 2025 · 6 comments
Assignees

Comments

@jakubgs
Copy link
Member

jakubgs commented Jan 3, 2025

We are currently low on storage for EL nodes on nimbus.mainnet fleet. Storage usage on /docker volume varies from 77% up to 99% on some nodes.

We need to:

  1. Request extension of existing storage from InnovaHosting.
    • Preferably with the same kind or at least size of NVMe.
  2. Backup existing node data either locally, or remotely, or re-sync from scratch.
    • If re-syncing is picked BNs will need additional EL while the sync happens.
  3. Re-create a RAID0 array using HP SmartArray CLI tool.
    • We don't care about data security since this can all be re-synced.
  4. Restore node data backups or re-sync.

You can see notes on previous task like this here:

@jakubgs
Copy link
Member Author

jakubgs commented Jan 3, 2025

Current state:

| Hostname                            | Volume  | Size | Used | Avail | Use% |
|-------------------------------------|---------|------|------|-------|------|
| erigon-01.ih-eu-mda1.nimbus.mainnet | /docker | 3.5T | 2.6T |  736G | 78% |
| erigon-02.ih-eu-mda1.nimbus.mainnet | /docker | 3.5T | 2.5T |  783G | 77% |
| geth-01.ih-eu-mda1.nimbus.mainnet   | /docker | 3.5T | 1.8T |  1.5T | 55% |
| geth-02.ih-eu-mda1.nimbus.mainnet   | /docker | 3.5T | 1.7T |  1.7T | 51% |

I have purged data of one Geth node on geth-01 and geth-02 as they were at 99% today and I wanted to buy us some time.

@markoburcul
Copy link
Contributor

The storage for host erigon-02.ih-eu-mda1.nimbus.mainnet with installation of the new drive with 3.8TB:

[email protected]:~ % lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 372.6G  0 disk 
└─sda1   8:1    0 372.6G  0 part /docker
                                 /
sdb      8:16   0   2.9T  0 disk /data
                                 /mnt/sdb
sdc      8:32   0     7T  0 disk /mnt/sdc
                                 /docker

The erigon stable node is synchronised but the unstable one is not. Probably it will become during the day, but I am monitoring it.

@markoburcul
Copy link
Contributor

Done with geth-01.ih-eu-mda1.nimbus.mainnet host:

[email protected]:/data % lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
sda      8:0    0 372.6G  0 disk 
└─sda1   8:1    0 372.6G  0 part /docker
                                 /
sdb      8:16   0   2.9T  0 disk /mnt/sdb
                                 /data
sdc      8:32   0     7T  0 disk /mnt/sdc
                                 /docker

The /data served me as a temporary storage for the node/data/geth/chaindata directory.

My procedure was like this(all under root user for simplicity):

# Shut down BN's and EL's
systemctl stop beacon-node-mainnet-stable
systemctl stop beacon-node-mainnet-unstable
docker compose -f  /docker/geth-mainnet-stable/docker-compose.yml down
docker compose -f  /docker/geth-mainnet-stable/docker-compose.exporter.yml down
docker compose -f  /docker/geth-mainnet-unstable/docker-compose.yml down
docker compose -f  /docker/geth-mainnet-unstable/docker-compose.exporter.yml down

# Create dirs for temporary data
mkdir -p /data/el_bak
mkdir -p /data/el_chaindata

# Save folder structure with config files
rsync -av --progress --exclude 'geth-mainnet-stable/node/data/' --exclude 'geth-mainnet-unstable/node/data/' /docker  /data/el_bak/

# Save chaindata
rsync -av --progress /docker/geth-mainnet-stable/node/data/geth/chaindata/ /data/el_chaindata/

# See drives
ssacli ctrl slot=0 pd all show
ssacli ctrl slot=0 ld all show status

# See which process has file open in /docker directory
lsof +D /docker

# For me wazuh-agent and rsyslog had to be shut down
systemctl stop wazuh-agent
systemctl stop rsyslog

# In my case logical drive 4 was mounted on /docker
umount /docker
ssacli ctrl slot=0 ld 4 delete

# See the unassigned drives and create logical drive from them
ssacli ctrl slot=0 pd all show status
ssacli ctrl slot=0 create type=ld raid=0 drives=1I:1:9,1I:1:10

# Check created drives
ssacli ctrl slot=0 ld all show status
lsblk

# Set filesystem on the logical volume, in my case device sdc
mkfs.ext4 /dev/sdc

# Make mounting permanent
echo "UUID=$(blkid -s UUID -o value /dev/sdc) /docker ext4 defaults 0 0" | sudo tee -a /etc/fstab

# Mount device on /docker
mkdir -p /docker
mount -a

# Check mounted device
lsblk

# Return folder structure and config files
rsync -av --progress /data/el_bak/ /docker/

# Start and shut down the geth container so the data folder is populated
docker compose -f  /docker/geth-mainnet-stable/docker-compose.yml up -d
docker compose -f  /docker/geth-mainnet-stable/docker-compose.yml down

# Remove the chaindata dir and recreate it
rm -r /docker/geth-mainnet-stable/node/data/geth/chaindata
mkdir /docker/geth-mainnet-stable/node/data/geth/chaindata
chown dockremap:dockremap /docker/geth-mainnet-stable/node/data/geth/chaindata

# Copy back chaindata
rsync -av --progress /data/el_chaindata/  /docker/geth-mainnet-stable/node/data/geth/chaindata

# In the end start node, exporter and beacon node
docker compose -f  /docker/geth-mainnet-stable/docker-compose.yml up -d
docker compose -f  /docker/geth-mainnet-stable/docker-compose.exporter.yml up -d
systemctl start beacon-node-mainnet-stable

The rsync of chaindata is around 1h more or less. Afterwards I've watched the Grafana dashboard to make sure node is synced.

@jakubgs
Copy link
Member Author

jakubgs commented Feb 4, 2025

Very nice notes.

PRO-TIP: rsync supports a --info=progress2 flag that shows progress for whole process and not individual files.

@markoburcul
Copy link
Contributor

Erigon has chaindata and snapshots from the data we want to keep, so almost all of the commands above are similar except the creation of el_chaindata and el_snapshots where we back up those separately.

mkdir  /data/el_bak
mkdir /data/el_chaindata
mkdir /data/el_snapshots

# Stop one erigon container and beacon node attached to it
docker compose -f /docker/erigon-mainnet-stable/docker-compose.yml down
systemctl stop beacon-node-mainnet-stable.service

# Rsync the folder structure and configs
rsync -av --info=progress2 --exclude 'erigon-mainnet-stable/data' --exclude 'erigon-mainnet-unstable/data'  /docker/erigon-mainnet-stable/ /data/el_bak

# Separately save chaindata and snapshots
rsync -av --info=progress2 /docker/erigon-mainnet-stable/data/chaindata/ /data/el_chaindata/
rsync -av --info=progress2 /docker/erigon-mainnet-stable/data/snapshots/ /data/el_snapshots/

Afterwards similar as above + returning the data at the right place and cleaning up.

@markoburcul
Copy link
Contributor

The Erigon mainnet nodes are fully synced now and with this the storage upgrade is done. I will upload an example script to infra-nimbus which can be served for future disk upgrades.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants