Host Monitoring Station(HMS) is my home-brew standalone monitoring system. This application can collect system metrics and then display graphs in a web page. HMS can ONLY monitor local host system metrics, it is NOT a distributed monitoring system. The reason why I built it is to give myself a simple and easy way to grasp system performance across several servers in my home.
HMS is a very lightweight monitoring system and can be running with minimum configuration efforts.
HMS uses RRDtool for local TSDB storage with Flask WSGI framework for the front-end web application.
HMS is constructed by the following components:
- RRD Databases Bootstrap Utility
- System Metrics Poller
- HMS Web Application
RRD Databases Bootstrap Utility is a utility to help users bootstrap RRD databases schema.
System Metrics Poller is an application to collect system metrics and write the values to local RRDtool TSDB.
HMS Web Application is the front-end web application to display RRD graphs. Users can use any WSGI server to run this web application. I shipped a uWSGI configuration file that can be running directly if users would like to use uWSGI as the WSGI server.
All source codes are located under the src
directory. Please DO NOT change any filename or subdirectory name.
├── hms
│ ├── arp.py
│ ├── cpu.py
│ ├── disk.py
│ ├── graph.py
│ ├── __init__.py
│ ├── memory.py
│ ├── network.py
│ ├── os.py
│ ├── tcp.py
│ ├── udp.py
│ └── utils.py
├── hms_bootstrap_rrd.py
├── hms_metrics_poller.py
├── hms_web.py
├── hms_web_uwsgi.ini
├── static
│ ├── config
│ │ └── hms.yaml
│ └── rrd_graph
│ └── placeholder
└── templates
└── hms.html
hms
directory is the core module package of HMS. This module includes all necessary functions and classes to collect metrics and generate RRD graphs.
hms_bootstrap_rrd.py
is the RRD Databases Bootstrap Utility.
hms_metrics_poller.py
is the System Metrics Poller.
hms_web.py
is the HMS Web Application.
hms_web_uwsgi.ini
is a uWSGI configuration file that can be used for running HMS web application directly.
static
directory is a place to save HMS configuration files and RRD graphs.
templates
directory is a place for rendering HMS web page.
HMS is written in Python3.
Following Python packages are needed:
flask
importlib.util
markupsafe
rrdtool
uWSGI + python3 plugin[optional]
yaml
In order to make the installation and configuration easier, I did not create any 3rd-party package. Users can clone the whole repository and configure some parameters to start running HMS. All commands should be running under src
directory.
Please follow the instructions below to set up and run HMS:
- Clone the whole repository in a directory.
$ git clone https://github.com/meow-watermelon/host-monitoring-station.git
- Configure the HMS configuration file
src/static/config/hms.yaml
. In this file, please define RRD_DB_PATH variable to a proper directory to save RRD databases. Please ignore other variables now as those might be used for future version. - Bootstrap RRD databases. Please use
hms_bootstrap_rrd.py
utility to bootstrap the RRD databases. Usage:
$ ./hms_bootstrap_rrd.py -h
usage: hms_bootstrap_rrd.py [-h] --dir DIR [--step STEP] [--component COMPONENT]
Host Monitoring Station RRD Database Bootstrap Tool
options:
-h, --help show this help message and exit
--dir DIR RRD database directory
--step STEP RRD database step (default: 1m)
--component COMPONENT
Components to be bootstrapped (default: os,cpu,memory,disk,network,tcp,udp,arp)
The default RRD database step is 1 minute. It s a recommended value in HMS. Please do not change this unless you know what you are doing. Collecting and writing metrics every minute is reasonable for a local monitoring system.
- Set up the system metrics poller. The poller completes collecting metrics and writing values to RRD databases in a running cycle. Usage:
$ ./hms_metrics_poller.py -h
usage: hms_metrics_poller.py [-h] --config CONFIG
Host Monitoring Station Metrics Poller
options:
-h, --help show this help message and exit
--config CONFIG Host Monitoring Station config file
The time period between each polling MUST match the step defined in the bootstrap step. For example, if the step of RRD databases is 1 minute then the metrics poller must be triggered every minute. Here is an example of how I run the poller in a bash terminal:
while true; do ./hms_metrics_poller.py --config static/config/hms.yaml; sleep 60; done
- Set up RRD graphs retention policy. RRD graphs are generated in real-time and will be only used once. So it does not make sense to save all RRD graphs because the graphs are useless once the graphs are displayed in HMS web application. Users can simply use cron to trigger the deletion based on the graph files modification time. Here is an example of crontab I use on my laptop:
* * * * * find /home/ericlee/Projects/git/host-monitoring-station/src/static/rrd_graph -type f -name '*.png' -mmin +1 -exec rm -rf '{}' \;
- Once the metrics poller is running, the RRD databases will have system metrics stored in the RRD TSDB and can be displayed in the HMS web application. All RRD graphs are in PNG format. The default HTTP service port of HMS web application is 4080 and web server stats port is 4081. Users can adjust those parameters in
hms_web_uwsgi.ini
file. To start the HMS web application please run the following command undersrc
directory:
$ uwsgi hms_web_uwsgi.ini
Once the HMS web application started, users can access the metrics graph via http://127.0.0.1:4080/hms. The default graph size is 900 x 300 pixels and display last 8 hours metrics. Users can query the historical data and display different graph size by using different URL query parameters. This will be covered by following section.
HMS web application supports 3 query parameters:
size: RRD graph size. The default one is medium size which is 900 x 300 pixels. There are also small and large which are 600 x 200 pixels and 1200 x 400 pixels.
start: RRD query start timestamp. The default is end-8h which is past 8 hours from the current time.
end: RRD query end timestamp. The default is now which is the current time.
For more information about start and end keywords please read the rrdgraph manual.
If start and / or end time span range from user input are not valid, HMS will use the default values for start and end parameters.
Category | Metric Name | Unit | Description |
---|---|---|---|
OS | loadavg_1min | n/a | 1 min load average |
OS | loadavg_5min | n/a | 5 min load average |
OS | loadavg_15min | n/a | 15 min load average |
OS | num_used_fd | count | number of occupied file descriptors |
OS | num_total_procs | count | number of total processes |
OS | num_running_procs | count | number of running processes |
OS | num_blocked_procs | count | number of blocked processes (e.g. I/O blocked) |
OS | num_zombie_procs | count | number of zombie processes |
OS | context_switch | count/second | number of context switches per second |
CPU | cpu_freq | kHz | CPU current running frequency |
Memory | memory_total | kB | total memory |
Memory | memory_free | kB | free memory |
Memory | memory_avail | kB | available memory |
Memory | buffer | kB | buffer |
Memory | cache | kB | cache |
Memory | swap_total | kB | total swap space |
Memory | swap_free | kB | free swap space |
Memory | page_tables | kB | page tables size |
Memory | minor_page_faults | count/second | number of minor page faults per second |
Memory | major_page_faults | count/second | number of major page faults per second |
Disk | read_io | count/second | number of read I/Os per second |
Disk | write_io | count/second | number of write I/Os per second |
Disk | read_merge | count/second | number of read I/Os merged per second |
Disk | write_merge | count/second | number of write I/Os merged per second |
Disk | read_sector | sector/second | number of sectors read per second |
Disk | write_sector | sector/second | number of sectors written per second |
Disk | in_flight | count/second | number of I/Os in flight per second |
Network | rx_bytes | byte/second | number of good received bytes per second |
Network | tx_bytes | byte/second | number of good transmitted bytes per second |
Network | rx_dropped | packet/second | number of packets received but dropped per second |
Network | tx_dropped | packet/second | number of packets dropped in transmission per second |
Network | rx_errors | packet/second | number of bad packets received per second |
Network | tx_errors | packet/second | number of bad packets transmitted per second |
Network | collisions | count/second | number of I/Os in flight per second |
IPv4/IPv6 TCP | ESTABLISHED | count | number of ESTABLISHED state sockets |
IPv4/IPv6 TCP | SYN_SENT | count | number of SYN_SENT state sockets |
IPv4/IPv6 TCP | SYN_RECV | count | number of SYN_RECV state sockets |
IPv4/IPv6 TCP | FIN_WAIT1 | count | number of FIN_WAIT1 state sockets |
IPv4/IPv6 TCP | FIN_WAIT2 | count | number of FIN_WAIT2 state sockets |
IPv4/IPv6 TCP | TIME_WAIT | count | number of TIME_WAIT state sockets |
IPv4/IPv6 TCP | CLOSE | count | number of CLOSE state sockets |
IPv4/IPv6 TCP | CLOSE_WAIT | count | number of CLOSE_WAIT state sockets |
IPv4/IPv6 TCP | LAST_ACK | count | number of LAST_ACK state sockets |
IPv4/IPv6 TCP | LISTEN | count | number of LISTEN state sockets |
IPv4/IPv6 TCP | CLOSING | count | number of CLOSING state sockets |
IPv4/IPv6 TCP | NEW_SYN_RECV | count | number of NEW_SYN_RECV state sockets |
UDP | InDatagrams | datagram/second | number of UDP datagrams delivered per second |
UDP | OutDatagrams | datagram/second | number of UDP datagrams sent per second |
UDP | InErrors | datagram/second | number of received UDP datagrams that could not be delivered per second |
UDP | NoPorts | datagram/second | number of received UDP datagrams for which there was no application at the destination port per second |
ARP | arp_cache_entries | count | number of ARP cache entries |
I saved some example screenshots in the screenshots
directory for reference.
- UI is ugly! I know that and I'm really not a UI/UX expert.
- No logs so far for all applications. I will add a logging facility in the next version.
- I would add more metrics in the future version but the current metrics are pretty sufficient for my own use. If you have any suggestions on metrics please open a bug to me.
- Better exception handling. The current version swallowed some exceptions to make the application run smoothly. I may write some customized exception classes in the future version for better debugging purposes.
- If a new disk device or network interface is added into the host the graph won't display metrics of the newly added devices. Because the current version does not support dynamic data sources adjustment. This feature will be added soon.
0.0.1
* initial commit
0.0.2 - 08/07/2022
* [issue#2] - add CPU frequency metric + graph feature
0.0.3 - 08/11/2022
* [issue#3] - use one RRA to store 1 year data metrics
0.0.4 - 08/13/2022
* [issue#1] - fix start / end time span range invalid issue
0.0.5 - 08/31/2022
* [issue#6] - add TCP metrics
0.0.6 - 09/04/2022
* [issue#4] - allow hms_bootstrap_rrd.py to bootstrap one or more data sources
0.0.7 - 09/05/2022
* [issue#8] - fix inaccurate num_total_procs metric issue
0.0.8 - 09/05/2022
* [issue#9] - catch exceptions if updating RRD database is failed
0.0.9 - 09/08/2022
* [issue#7] - add UDP metrics
0.0.10 - 12/22/2022
* [issue#10] - fix start and end parameters exception
0.0.11 - 01/28/2024
* [issue#13] - add ARP cache entries metric
0.0.12 - 12/01/2024
* [issue#15] - add page tables metric
0.0.13 - 12/21/2024
* [issue#17] - add minor + major page faults counts