Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2021 Unidata Equipment Grant Omnibus #277

Open
5 tasks
akrherz opened this issue Jul 7, 2021 · 11 comments
Open
5 tasks

2021 Unidata Equipment Grant Omnibus #277

akrherz opened this issue Jul 7, 2021 · 11 comments
Assignees

Comments

@akrherz
Copy link
Owner

akrherz commented Jul 7, 2021

The IEM is 🙏 to have received an equipment grant from Unidata. The grant is purchasing a Dell R7525 with NVMe drives, which I should be able to rule the world with its capacity :) The setup of the server will attempt to follow what Letsencrypt did.

The proposal outlined a number of deliverables, so this issue is an omnibus tracking these items and more.

  • Remove or greatly relax the software throttles on the METAR, SHEF downloads.
  • Create more web services to expose the SHEF archives.
  • Add code support for various web services within siphon.
  • Provide CONUS scale aggregates of climodat/COOP archives.

For my reporting benefit, a timeline of how things have progressed this far.

  • 3 May 2021 - Unidata announced funding of proposal.
  • 30 Jun 2021 - UCAR/ISU signed off on grant paperwork.
  • 6 Jul 2021 - Workday tag created and ready for spending.
  • 6 Jul 2021 - Dell purchase order submitted.

So this issue will collect up random things so to help my eventual delivery of a:

  • blog post to Unidata.
@akrherz akrherz self-assigned this Jul 7, 2021
@akrherz
Copy link
Owner Author

akrherz commented Jul 20, 2021

Server was delivered late 19 July 2021 and setup in its final resting place on the morning of 20 July 2021. RHEL 8.4 was installed on the root RAID1 SSD 500GB. The first decision point is run kernel-ml or not so to support my legacy Infiniband network. Lets try it and run a test to see if it is fast enough to matter. Moving a 5 GB empty file via scp

network rate time
ipoib 327 MB/s 15s
iem1 111 MB/s 46s
iem0 112 MB/s 46s
  • gonna stick with kernel-ml as the infiniband network is likely useful to keep around.

@akrherz
Copy link
Owner Author

akrherz commented Jul 20, 2021

Just to establish some crude baselines

mkfs.xfs /dev/nvme0n1
mount /dev/nvme0n1 /mnt/test
# time dd if=/dev/zero of=test.iso count=10M
10485760+0 records in
10485760+0 records out
5368709120 bytes (5.4 GB, 5.0 GiB) copied, 8.39105 s, 640 MB/s

real	0m8.392s
user	0m1.019s
sys	0m7.345s
hdparm -Tt /dev/nvme0n1

/dev/nvme0n1:
 Timing cached reads:   21058 MB in  2.00 seconds = 10544.23 MB/sec
 Timing buffered disk reads: 10270 MB in  3.00 seconds = 3423.18 MB/sec
  • Lets install Postgresql 13 to start benchmarking with it

@akrherz
Copy link
Owner Author

akrherz commented Jul 20, 2021

Discussion with local ZFS expert SN

  • You can probably set the logbias to throughput
  • primarycache=metadata, since postgres maintains a cache
  • xattr=sa (and probably acltype=posixacl), atime=off and then decide if you even want relatime
  • then for compression the default lz4 will probably serve you perfectly well
  • blog about perf
  • adjust your ashift accordingly
  • directio support

@akrherz
Copy link
Owner Author

akrherz commented Jul 22, 2021

Yesterday was spent doing lots of ZFS tests. It seems like we can make it work out. Some more systematic tests today

pgbench commands, not that ideal connections and threads setting, but just wanting a baseline

/usr/pgsql-13/bin/pgbench -i -s 50 example
/usr/pgsql-13/bin/pgbench -c 10 -j 8 -t 100000 example

Note that the baseline tests for the current machines running postgresql in the IEM cluster were not isolated, but under routine load. The afos q1 and afos q2 tests were some basic queries that I want the database to perform well for.

zpool label pgb load [s] pgb latency [ms] pgb [tps] afos q1 [s] afos q2 [s]
N/A metvm4 (v12) 9.44 2.027 4,934 - -
N/A metvm6 (v12) 6.81 0.810 12,349 0.003 85.707
N/A metvm1 (v12) 12.37 1.657 6,036 - -
N/A metvm7 (v13) 15.40 2.626 3,808 - -
N/A laptop (v13) 7.53 8.211 1,127 - -
N/A IRIS RHEV NVMe (v13) - 9.212 1,085 - -
N/A IRIS RHEV SSD (v13) - 12.768 783 - -
N/A XFS single 5.80 0.514 19,471 - -
N/A XFS raid10 5.55 0.551 18,140 - -
raidz2 lz4_128K 7.75 0.702 14,246 0.010 150.197
raidz2 lz4_8K 9.52 0.704 14,209 0.013 352.795
raidz2 off_128K 7.16 0.677 14,776 0.011 96.898
raidz2 off_128K_metadata 24.71 0.772 12,948 0.012 (gave up)
zmirror off_128K 6.37 0.566 17,663 0.003 79.678
zmirror lz4_128K 6.31 0.595 16,815 0.005 135.83
zmirror lz4_64K 6.46 0.577 17,339 0.004 143.89
zmirror lz4_32K 6.82 0.585 17,093 0.005 181.32
zmirror off_8K 7.94 0.634 15,780 0.004 268.01

I probably need to start moving this process along and moving back to other work, so we are drawing lines in the stand with new decisions made:

  • ZFS is a viable option with performance as good as XFS out of the box. Yes, a tuned XFS + RAID10 may perform better, but compression is a requirement to make this project work.
  • zmirror should perform better for the most common workloads vs raidz2 and offer redundancy. I am not concerned with the drop in available storage space with this choice.
  • recordsize=64K seems to be a decent middle ground between throughput and tps. Will next try to tune postgresql against that.

In general, I am not attempting to squeeze 5-10% of performance out of this setup, but just get something that does not go up in flames under load. It would also be good to continue to move the needle to the right as additional choices are made, like cache settings.

@akrherz
Copy link
Owner Author

akrherz commented Jul 22, 2021

We have standardized on recordsize=64K and compression=lz4. So now we iterate and rerun the tests above. Note that these are one-shot runs, so some of these are noisy due to warm caches, etc.

change pgb load [s] pgb latency [ms] pgb [tps] afos q1 [s] afos q2 [s] OK?
baseline 6.46 0.577 17,339 0.004 143.89
add zWAL recordsize=8K,compression=off 7.02 0.607 16,470 0.010 147.91
set zWAL recordsize=64K 7.28 0.605 16,541 0.001 142.46 0️⃣
set zWAL recordize=8K, set PG full_write_pages=off 7.40 0.541 18,469 0.009 147.57 👍
set PG shared_buffers=16G 7.12 0.543 18,428 0.010 168.73 👎
set PG shared_buffers=2G 7.01 0.547 18,271 0.010 150.19 0️⃣
set PG shared_buffers=4G 7.10 0.548 18,236 0.009 149.24 moving on
set PG max_parallel_workers_per_gather=16 et al 7.05 0.544 18,374 0.009 57.85 👍
set PG fsync=off for funzies 6.59 0.426 23,479 0.010 55.14 reverting
set PG random_page_cost=0.4 7.03 0.553 18,090 0.010 57.64 reverting for now
set zfs logbias=throughput 7.12 0.580 17,238 0.010 56.87 reverting
set zfs relatime=on 6.89 0.557 17,953 0.002 58.31 👍
set zWAL primarycache=metadata 7.33 0.548 18,239 0.002 58.88 👍

So I am not getting much of anywhere at the moment. Perhaps it is good now to move the goalpost and run a more relevant pgbench setup, stepping back one second

$ /usr/pgsql-13/bin/pgbench -S -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: select only>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.066 ms
tps = 481409.762442 (including connections establishing)
tps = 481914.500144 (excluding connections establishing)
$ /usr/pgsql-13/bin/pgbench -M prepared -t 100000 -c 32 -j 32 example
starting vacuum...end.
transaction type: <builtin: TPC-B (sort of)>
scaling factor: 50
query mode: prepared
number of clients: 32
number of threads: 32
number of transactions per client: 100000
number of transactions actually processed: 3200000/3200000
latency average = 0.700 ms
tps = 45715.978777 (including connections establishing)
tps = 45720.701627 (excluding connections establishing)

zfs get all tank/pg13wal

``` NAME PROPERTY VALUE SOURCE tank/pg13wal type filesystem - tank/pg13wal creation Thu Jul 22 12:12 2021 - tank/pg13wal used 1.01G - tank/pg13wal available 2.17T - tank/pg13wal referenced 1.01G - tank/pg13wal compressratio 1.00x - tank/pg13wal mounted yes - tank/pg13wal quota none default tank/pg13wal reservation none default tank/pg13wal recordsize 8K local tank/pg13wal mountpoint /var/lib/pgsql/13/data/pg_wal local tank/pg13wal sharenfs off default tank/pg13wal checksum on default tank/pg13wal compression off local tank/pg13wal atime off inherited from tank tank/pg13wal devices on default tank/pg13wal exec on default tank/pg13wal setuid on default tank/pg13wal readonly off default tank/pg13wal zoned off default tank/pg13wal snapdir hidden default tank/pg13wal aclmode discard default tank/pg13wal aclinherit restricted default tank/pg13wal createtxg 3724 - tank/pg13wal canmount on default tank/pg13wal xattr sa inherited from tank tank/pg13wal copies 1 default tank/pg13wal version 5 - tank/pg13wal utf8only off - tank/pg13wal normalization none - tank/pg13wal casesensitivity sensitive - tank/pg13wal vscan off default tank/pg13wal nbmand off default tank/pg13wal sharesmb off default tank/pg13wal refquota none default tank/pg13wal refreservation none default tank/pg13wal guid 1024111154508093058 - tank/pg13wal primarycache metadata local tank/pg13wal secondarycache all default tank/pg13wal usedbysnapshots 0B - tank/pg13wal usedbydataset 1.01G - tank/pg13wal usedbychildren 0B - tank/pg13wal usedbyrefreservation 0B - tank/pg13wal logbias latency local tank/pg13wal objsetid 5830 - tank/pg13wal dedup off default tank/pg13wal mlslabel none default tank/pg13wal sync standard default tank/pg13wal dnodesize legacy default tank/pg13wal refcompressratio 1.00x - tank/pg13wal written 1.01G - tank/pg13wal logicalused 1.01G - tank/pg13wal logicalreferenced 1.01G - tank/pg13wal volmode default default tank/pg13wal filesystem_limit none default tank/pg13wal snapshot_limit none default tank/pg13wal filesystem_count none default tank/pg13wal snapshot_count none default tank/pg13wal snapdev hidden default tank/pg13wal acltype off default tank/pg13wal context none default tank/pg13wal fscontext none default tank/pg13wal defcontext none default tank/pg13wal rootcontext none default tank/pg13wal relatime on inherited from tank tank/pg13wal redundant_metadata all default tank/pg13wal overlay on default tank/pg13wal encryption off default tank/pg13wal keylocation none default tank/pg13wal keyformat none default tank/pg13wal pbkdf2iters 0 default tank/pg13wal special_small_blocks 0 default ```

zfs get all tank/pg13data_lz4_64K

``` NAME PROPERTY VALUE SOURCE tank/pg13data_lz4_64K type filesystem - tank/pg13data_lz4_64K creation Thu Jul 22 10:28 2021 - tank/pg13data_lz4_64K used 367G - tank/pg13data_lz4_64K available 2.17T - tank/pg13data_lz4_64K referenced 367G - tank/pg13data_lz4_64K compressratio 1.35x - tank/pg13data_lz4_64K mounted yes - tank/pg13data_lz4_64K quota none default tank/pg13data_lz4_64K reservation none default tank/pg13data_lz4_64K recordsize 64K local tank/pg13data_lz4_64K mountpoint /var/lib/pgsql/13 local tank/pg13data_lz4_64K sharenfs off default tank/pg13data_lz4_64K checksum on default tank/pg13data_lz4_64K compression lz4 local tank/pg13data_lz4_64K atime off inherited from tank tank/pg13data_lz4_64K devices on default tank/pg13data_lz4_64K exec on default tank/pg13data_lz4_64K setuid on default tank/pg13data_lz4_64K readonly off default tank/pg13data_lz4_64K zoned off default tank/pg13data_lz4_64K snapdir hidden default tank/pg13data_lz4_64K aclmode discard default tank/pg13data_lz4_64K aclinherit restricted default tank/pg13data_lz4_64K createtxg 1427 - tank/pg13data_lz4_64K canmount on default tank/pg13data_lz4_64K xattr sa inherited from tank tank/pg13data_lz4_64K copies 1 default tank/pg13data_lz4_64K version 5 - tank/pg13data_lz4_64K utf8only off - tank/pg13data_lz4_64K normalization none - tank/pg13data_lz4_64K casesensitivity sensitive - tank/pg13data_lz4_64K vscan off default tank/pg13data_lz4_64K nbmand off default tank/pg13data_lz4_64K sharesmb off default tank/pg13data_lz4_64K refquota none default tank/pg13data_lz4_64K refreservation none default tank/pg13data_lz4_64K guid 4354124984647248473 - tank/pg13data_lz4_64K primarycache all default tank/pg13data_lz4_64K secondarycache all default tank/pg13data_lz4_64K usedbysnapshots 0B - tank/pg13data_lz4_64K usedbydataset 367G - tank/pg13data_lz4_64K usedbychildren 0B - tank/pg13data_lz4_64K usedbyrefreservation 0B - tank/pg13data_lz4_64K logbias latency local tank/pg13data_lz4_64K objsetid 5533 - tank/pg13data_lz4_64K dedup off default tank/pg13data_lz4_64K mlslabel none default tank/pg13data_lz4_64K sync standard default tank/pg13data_lz4_64K dnodesize legacy default tank/pg13data_lz4_64K refcompressratio 1.35x - tank/pg13data_lz4_64K written 367G - tank/pg13data_lz4_64K logicalused 497G - tank/pg13data_lz4_64K logicalreferenced 497G - tank/pg13data_lz4_64K volmode default default tank/pg13data_lz4_64K filesystem_limit none default tank/pg13data_lz4_64K snapshot_limit none default tank/pg13data_lz4_64K filesystem_count none default tank/pg13data_lz4_64K snapshot_count none default tank/pg13data_lz4_64K snapdev hidden default tank/pg13data_lz4_64K acltype off default tank/pg13data_lz4_64K context none default tank/pg13data_lz4_64K fscontext none default tank/pg13data_lz4_64K defcontext none default tank/pg13data_lz4_64K rootcontext none default tank/pg13data_lz4_64K relatime on inherited from tank tank/pg13data_lz4_64K redundant_metadata all default tank/pg13data_lz4_64K overlay on default tank/pg13data_lz4_64K encryption off default tank/pg13data_lz4_64K keylocation none default tank/pg13data_lz4_64K keyformat none default tank/pg13data_lz4_64K pbkdf2iters 0 default tank/pg13data_lz4_64K special_small_blocks 0 default ```

@akrherz
Copy link
Owner Author

akrherz commented Jul 22, 2021

After a colleague review, we now did:

  1. zfs set relatime=off tank
  2. gave 200 GB to ZFS ARC.

No change with the most recent benchmark numbers.

@akrherz
Copy link
Owner Author

akrherz commented Jul 22, 2021

🚀 coop production database now running on this host.

@akrherz
Copy link
Owner Author

akrherz commented Jul 23, 2021

🚀 hads production database is now on the new hardware and the compression savings are glorious. 2.1TB -> 600 GB

@akrherz
Copy link
Owner Author

akrherz commented Aug 9, 2021

Time passes and some depression sets in. I am sort of in no-man's land awaiting postgresql 14 to drop, wanting to rearrange the database ducks to align performance to the databases that are mentioned in the proposal, and adding the new services. The new server is performing great and without known issues, so that's good. I just don't get the warm fuzzies of being able to conquer the world with this thing.

@akrherz
Copy link
Owner Author

akrherz commented Sep 8, 2021

Since I am conveniently lazy, I am going to drag my feet a bit longer and await the PostgreSQL 14 release due by the end of September.

@akrherz
Copy link
Owner Author

akrherz commented Sep 16, 2021

PostgreSQL 14 is scheduled to be released on Sept 30.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant