Skip to content

Analyses your database queries and schema and suggests indices and schema improvements

License

Notifications You must be signed in to change notification settings

macbre/index-digest

Repository files navigation

index-digest

PyPI Docker Hub Coverage Status

Analyses your database queries and schema and suggests indices improvements. You can use index-digest as your database linter. The goal is to provide the user with actionable reports instead of just a list of statistics and schema details. Inspired by Percona's pt-index-usage.

NEW You can also use index-digest as GitHub's Action.

What this tool does

index-digest does the following:

  • it checks the schema of all tables in a given database and suggests improvements (e.g. removal of redundant indices, adding a primary key to ease replication, dropping tables with just a single column or no rows)
  • if provided with SQL queries log (via --sql-log option) it:
    • checks if all tables, columns and indices are used by these queries
    • reports text columns with character set different than utf
    • reports queries that do not use indices
    • reports queries that use filesort, temporary file or full table scan
    • reports queries that are not quite kosher (e.g. LIKE "%foo%", INSERT IGNORE, SELECT *, HAVING clause, high OFFSET in pagination queries)
  • if run with --analyze-data switch it:
    • reports tables with old data (by querying for MIN() value of time column) where data retency can be reviewed
    • reports tables with not up-to-date data (by querying for MAX() value of time column)
  • if run with --check-empty-databases switch it:
    • report empty databases on the current MySQL server

This tool supports MySQL 5.7, 8.0, 8.1, Percona Server 8.0 and MariaDB 10.1, 10.2, 10.5, 10.6 and runs under Python 3.8+.

Results can be reported in a human-readable form, as YAML or sent to syslog and later aggregated & processed using ELK stack.

Requirements & install

From pypi

pip install indexdigest

From git

git clone [email protected]:macbre/index-digest.git && cd index-digest
sudo apt-get install libmysqlclient-dev python3-dev virtualenv

virtualenv -ppython3 env
source env/bin/activate
make install

When using MacOS, you should follow this mysql_config installation steps.

Running tests

We assume that the test database is running locally on port 53306. You can use the following to test your changes locally before pushing them (this one uses MySQL 8.0.20):

docker run --rm -p 53306:3306 --health-cmd="mysqladmin ping" --health-interval=10s --health-timeout=5s --health-retries=3 -e "MYSQL_ALLOW_EMPTY_PASSWORD=yes" -e "MYSQL_DATABASE=index_digest" --name=index_digest_mysql mysql:8.0.22 "--default-authentication-plugin=mysql_native_password"

Wait until the server is up and running.

mysql --protocol=tcp --port=53306 -u root --password="" -v < setup.sql
./sql/populate.sh
mysql --protocol=tcp --port=53306 -uindex_digest -pqwerty index_digest -v -e '\s; SHOW TABLES; SHOW DATABASES;'

make test

Using Docker

See https://hub.docker.com/r/macbre/index-digest/

$ docker run --network=host -t macbre/index-digest:latest mysql://index_digest:qwerty@debian/index_digest  | head -n 20
------------------------------------------------------------
Found 61 issue(s) to report for "index_digest" database
------------------------------------------------------------
MySQL v5.7.22 at debian
index-digest v1.2.0
------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo

✗ "idx" index can be removed as redundant (covered by "PRIMARY")

  - redundant: UNIQUE KEY idx (item_id, foo)
  - covered_by: PRIMARY KEY (item_id, foo)
  - schema: CREATE TABLE `0004_id_foo` (
      `item_id` int(9) NOT NULL AUTO_INCREMENT,
      `foo` varbinary(16) NOT NULL DEFAULT '',
      PRIMARY KEY (`item_id`,`foo`),
      UNIQUE KEY `idx` (`item_id`,`foo`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
  - table_data_size_mb: 0.015625
  - table_index_size_mb: 0.015625
...

How to run it?

$ index_digest -h
index_digest

Analyses your database queries and schema and suggests indices improvements.

Usage:
  index_digest DSN [--sql-log=<file>] [--format=<formatter>] [--analyze-data] [--checks=<checks> | --skip-checks=<skip-checks>] [--tables=<tables> | --skip-tables=<skip-tables>]
  index_digest (-h | --help)
  index_digest --version

Options:
  DSN               Data Source Name of database to check
  --sql-log=<file>  Text file with SQL queries to check against the database
  --format=<formatter>  Use a given results formatter (plain, syslog, yaml)
  --analyze-data    Run additional checks that will query table data (can be slow!)
  --checks=<list>   Comma-separated lists of checks to report
  --skip-checks=<list> Comma-separated lists of checks to skip from report
  --tables=<list>   Comma-separated lists of tables to report
  --skip-tables=<list> Comma-separated lists of tables to skip from report
  -h --help         Show this screen.
  --version         Show version.

Examples:
  index_digest mysql://username:password@localhost/dbname
  index_digest mysql://index_digest:qwerty@localhost/index_digest --sql-log=sql.log
  index_digest mysql://index_digest:qwerty@localhost/index_digest --skip-checks=non_utf_columns
  index_digest mysql://index_digest:qwerty@localhost/index_digest --analyze-data --checks=data_too_old,data_not_updated_recently
  index_digest mysql://index_digest:qwerty@localhost/index_digest --analyze-data --skip-tables=DATABASECHANGELOG,DATABASECHANGELOGLOCK

Visit <https://github.com/macbre/index-digest>

SQL query log

It's a text file with a single SQL query in each line (no line breaks are allowed). Lines that do start with -- (SQL comment) are ignored. The file can be generated using query-digest when --sql-log output mode is selected.

An example:

-- A comment
select * from 0002_not_used_indices order by id
select * from 0002_not_used_indices where foo = 'foo' and id = 2
select count(*) from 0002_not_used_indices where foo = 'foo'
/* foo bar */ select * from 0002_not_used_indices where bar = 'foo'
INSERT  IGNORE INTO `0070_insert_ignore` VALUES ('123', 9, '2017-01-01');

MySQL's slow query log needs to be pre-processed first (to remove comments and timestamps):

cat mysql-slow.log | egrep -v '^(SET timestamp|#|throttle: )' > queries.log

Then you can run index_digest --sql-log=queries.log ....

Formatters

index-digest can return results in various formats (use --format to choose one).

plain

Emits human-readable report to a console. You can disable colored and bold text by setting env variable ANSI_COLORS_DISABLED=1.

syslog

Pushes JSON-formatted messages via syslog, so they can be aggregated using ELK stack. Use SYSLOG_IDENT env variable to customize syslog's ident messages are sent with (defaults to index-digest).

Dec 28 15:59:58 debian index-digest[17485]: {"meta": {"version": "index-digest v0.1.0", "database_name": "index_digest", "database_host": "debian", "database_version": "MySQL v5.7.20"}, "report": {"type": "redundant_indices", "table": "0004_id_foo", "message": "\"idx\" index can be removed as redundant (covered by \"PRIMARY\")", "context": {"redundant": "UNIQUE KEY idx (id, foo)", "covered_by": "PRIMARY KEY (id, foo)", "schema": "CREATE TABLE `0004_id_foo` (\n  `id` int(9) NOT NULL AUTO_INCREMENT,\n  `foo` varbinary(16) NOT NULL DEFAULT '',\n  PRIMARY KEY (`id`,`foo`),\n  UNIQUE KEY `idx` (`id`,`foo`)\n) ENGINE=InnoDB DEFAULT CHARSET=latin1", "table_data_size_mb": 0.015625, "table_index_size_mb": 0.015625}}}

yaml

Outputs YML file with results and metadata.

Checks

You can select which checks should be reported by the tool by using --checks command line option. Certain checks can also be skipped via --skip-checks option. Refer to index_digest --help for examples.

Number of checks: 24

  • redundant_indices: reports indices that are redundant and covered by other
  • non_utf_columns: reports text columns that have characters encoding set to latin1 (utf is the way to go)
  • missing_primary_index: reports tables with no primary or unique key (see MySQL bug #76252 and Wikia/app#9863). Primary keys can be enforced on MySQL config level since 8.0.13 (via sql_require_primary_key variable).
  • test_tables: reports tables that seem to be test leftovers (e.g. some_guy_test_table)
  • single_column: reports tables with just a single column
  • empty_tables: reports tables with no rows
  • generic_primary_key: reports tables with a primary key on id column (a more meaningful name should be used)
  • use_innodb: reports table using storage engines different than InnoDB (a default for MySQL 5.5+ and MariaDB 10.2+)
  • low_cardinality_index: reports indices with low cardinality

Additional checks performed on SQL log

You need to provide SQL log file via --sql-log option

Additional checks performed on tables data

You need to use --analyze-data command line switch. Please note that these checks will query your tables. These checks can take a while if queried columns are not indexed.

  • data_too_old: reports tables that have really old data, maybe it's worth checking if such long data retention is actually needed (defaults to three months threshold, can be customized via INDEX_DIGEST_DATA_TOO_OLD_THRESHOLD_DAYS env variable)
  • data_not_updated_recently: reports tables that were not updated recently, check if it should be up-to-date (defaults a month threshold, can be customized via INDEX_DIGEST_DATA_NOT_UPDATED_RECENTLY_THRESHOLD_DAYS env variable)

Additional checks performed across database on the current MySQL server

You need to use --check-empty-databases command line switch.

  • empty_database: reports databases that have no BASE TABLE tables (as provided by information_schema.TABLES)

An example report

$ index_digest mysql://index_digest:qwerty@localhost/index_digest --sql-log sql/0002-not-used-indices-log 
------------------------------------------------------------
Found 85 issue(s) to report for "index_digest" database
------------------------------------------------------------
MySQL v5.7.21 at debian
index-digest v1.0.0
------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo

✗ "idx" index can be removed as redundant (covered by "PRIMARY")

  - redundant: UNIQUE KEY idx (id, foo)
  - covered_by: PRIMARY KEY (id, foo)
  - schema: CREATE TABLE `0004_id_foo` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      `foo` varbinary(16) NOT NULL DEFAULT '',
      PRIMARY KEY (`id`,`foo`),
      UNIQUE KEY `idx` (`id`,`foo`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
  - table_data_size_mb: 0.015625
  - table_index_size_mb: 0.015625

------------------------------------------------------------
redundant_indices → table affected: 0004_id_foo_bar

✗ "idx_foo" index can be removed as redundant (covered by "idx_foo_bar")

  - redundant: KEY idx_foo (foo)
  - covered_by: KEY idx_foo_bar (foo, bar)
  - schema: CREATE TABLE `0004_id_foo_bar` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      `foo` varbinary(16) NOT NULL DEFAULT '',
      `bar` varbinary(16) NOT NULL DEFAULT '',
      PRIMARY KEY (`id`),
      KEY `idx_foo` (`foo`),
      KEY `idx_foo_bar` (`foo`,`bar`),
      KEY `idx_id_foo` (`id`,`foo`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1
  - table_data_size_mb: 0.015625
  - table_index_size_mb: 0.046875

------------------------------------------------------------
missing_primary_index → table affected: 0034_querycache

✗ "0034_querycache" table does not have any primary or unique index

  - schema: CREATE TABLE `0034_querycache` (
      `qc_type` varbinary(32) NOT NULL,
      `qc_value` int(10) unsigned NOT NULL DEFAULT '0',
      `qc_namespace` int(11) NOT NULL DEFAULT '0',
      `qc_title` varchar(255) CHARACTER SET latin1 COLLATE latin1_bin NOT NULL DEFAULT '',
      KEY `qc_type` (`qc_type`,`qc_value`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

------------------------------------------------------------
test_tables → table affected: 0075_some_guy_test_table

✗ "0075_some_guy_test_table" seems to be a test table

  - schema: CREATE TABLE `0075_some_guy_test_table` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      `name` varchar(255) NOT NULL,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

------------------------------------------------------------
single_column → table affected: 0074_bag_of_ints

✗ "0074_bag_of_ints" has just a single column

  - schema: CREATE TABLE `0074_bag_of_ints` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

------------------------------------------------------------
empty_tables → table affected: 0089_empty_table

✗ "0089_empty_table" table has no rows, is it really needed?

  - schema: CREATE TABLE `0089_empty_table` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1

------------------------------------------------------------
generic_primary_key → table affected: 0094_generic_primary_key

✗ "0094_generic_primary_key" has a primary key called id, use a more meaningful name

  - schema: CREATE TABLE `0094_generic_primary_key` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      `foo` varchar(16) NOT NULL DEFAULT '',
      PRIMARY KEY (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1

------------------------------------------------------------
use_innodb → table affected: 0036_use_innodb_myisam

✗ "0036_use_innodb_myisam" uses MyISAM storage engine

  - schema: CREATE TABLE `0036_use_innodb_myisam` (
      `item_id` int(9) NOT NULL AUTO_INCREMENT,
      `foo` int(8) DEFAULT NULL,
      PRIMARY KEY (`item_id`)
    ) ENGINE=MyISAM DEFAULT CHARSET=latin1
  - engine: MyISAM

------------------------------------------------------------
not_used_indices → table affected: 0002_not_used_indices

✗ "test_id_idx" index was not used by provided queries

  - not_used_index: KEY test_id_idx (test, id)

------------------------------------------------------------
not_used_tables → table affected: 0020_big_table

✗ "0020_big_table" table was not used by provided queries

  - schema: CREATE TABLE `0020_big_table` (
      `id` int(9) NOT NULL AUTO_INCREMENT,
      `val` int(9) NOT NULL,
      `text` char(5) NOT NULL,
      PRIMARY KEY (`id`),
      KEY `text_idx` (`text`)
    ) ENGINE=InnoDB AUTO_INCREMENT=100001 DEFAULT CHARSET=utf8
  - table_size_mb: 5.03125
  - rows_estimated: 100405

------------------------------------------------------------
insert_ignore → table affected: 0070_insert_ignore

✗ "INSERT IGNORE INTO `0070_insert_ignore` VALUES (9,..." query uses a risky INSERT IGNORE

  - query: INSERT IGNORE INTO `0070_insert_ignore` VALUES (9, '123', '2017-01-01');
  - schema: CREATE TABLE `0070_insert_ignore` (
      `id` int(9) NOT NULL,
      `text` char(5) NOT NULL,
      `time` datetime DEFAULT NULL,
      UNIQUE KEY `id` (`id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8

------------------------------------------------------------
non_utf_columns → table affected: 0032_latin1_table

✗ "name" text column has "latin1" character set defined

  - column: name
  - column_character_set: latin1
  - column_collation: latin1_swedish_ci
  - schema: CREATE TABLE `0032_latin1_table` (
      `item_id` int(9) NOT NULL AUTO_INCREMENT,
      `name` varchar(255) DEFAULT NULL,
      `utf8_column` varchar(255) CHARACTER SET utf8 COLLATE utf8_polish_ci NOT NULL,
      `ucs2_column` varchar(255) CHARACTER SET ucs2 DEFAULT NULL,
      `utf8mb4_column` varchar(255) CHARACTER SET utf8mb4 DEFAULT NULL,
      `utf16_column` varchar(255) CHARACTER SET utf16 DEFAULT NULL,
      `utf32_column` varchar(255) CHARACTER SET utf32 DEFAULT NULL,
      `binary_column` varbinary(255) DEFAULT NULL,
      `latin_blob` blob,
      PRIMARY KEY (`item_id`)
    ) ENGINE=InnoDB DEFAULT CHARSET=latin1

------------------------------------------------------------

(...)

------------------------------------------------------------
queries_using_filesort → table affected: 0020_big_table

✗ "SELECT val, count(*) FROM 0020_big_table WHERE id ..." query used filesort

  - query: SELECT val, count(*) FROM 0020_big_table WHERE id BETWEEN 10 AND 20 GROUP BY val
  - explain_extra: Using where; Using temporary; Using filesort
  - explain_rows: 11
  - explain_filtered: None
  - explain_key: PRIMARY

------------------------------------------------------------
queries_using_temporary → table affected: 0020_big_table

✗ "SELECT val, count(*) FROM 0020_big_table WHERE id ..." query used temporary

  - query: SELECT val, count(*) FROM 0020_big_table WHERE id BETWEEN 10 AND 20 GROUP BY val
  - explain_extra: Using where; Using temporary; Using filesort
  - explain_rows: 11
  - explain_filtered: None
  - explain_key: PRIMARY

------------------------------------------------------------
queries_using_full_table_scan → table affected: 0020_big_table

✗ "SELECT * FROM 0020_big_table" query triggered full table scan

  - query: SELECT * FROM 0020_big_table
  - explain_rows: 9041

------------------------------------------------------------
selects_with_like → table affected: 0020_big_table

✗ "SELECT * FROM 0020_big_table WHERE text LIKE '%00'" query uses LIKE with left-most wildcard

  - query: SELECT * FROM 0020_big_table WHERE text LIKE '%00'
  - explain_extra: Using where
  - explain_rows: 100623

------------------------------------------------------------
select_star → table affected: bar

✗ "SELECT t.* FROM bar AS t" query uses SELECT *

  - query: SELECT t.* FROM bar AS t;

------------------------------------------------------------
having_clause → table affected: sales

✗ "SELECT s.cust_id,count(s.cust_id) FROM SH.sales s ..." query uses HAVING clause

  - query: SELECT s.cust_id,count(s.cust_id) FROM SH.sales s GROUP BY s.cust_id HAVING s.cust_id != '1660' AND s.cust_id != '2'

(...)

------------------------------------------------------------
low_cardinality_index → table affected: 0020_big_table

✗ "num_idx" index on "num" column has low cardinality, check if it is needed

  - column_name: num
  - index_name: num_idx
  - index_cardinality: 2
  - schema: CREATE TABLE `0020_big_table` (
      `item_id` int(9) NOT NULL AUTO_INCREMENT,
      `val` int(9) NOT NULL,
      `text` char(5) NOT NULL,
      `num` int(3) NOT NULL,
      PRIMARY KEY (`item_id`),
      KEY `text_idx` (`text`),
      KEY `num_idx` (`num`)
    ) ENGINE=InnoDB AUTO_INCREMENT=100001 DEFAULT CHARSET=utf8
  - value_usage: 33.24788541334185

(...)

------------------------------------------------------------
data_too_old → table affected: 0028_data_too_old

✗ "0028_data_too_old" has rows added 184 days ago, consider changing retention policy

  - diff_days: 184
  - data_since: 2017-08-17 12:03:44
  - data_until: 2018-02-17 12:03:44
  - date_column_name: timestamp
  - schema: CREATE TABLE `0028_data_too_old` (
      `item_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
      `cnt` int(8) unsigned NOT NULL,
      `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
      PRIMARY KEY (`item_id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=latin1
  - rows: 4
  - table_size_mb: 0.015625

------------------------------------------------------------
data_not_updated_recently → table affected: 0028_data_not_updated_recently

✗ "0028_data_not_updated_recently" has the latest row added 40 days ago, consider checking if it should be up-to-date

  - diff_days: 40
  - data_since: 2017-12-29 12:03:44
  - data_until: 2018-01-08 12:03:44
  - date_column_name: timestamp
  - schema: CREATE TABLE `0028_data_not_updated_recently` (
      `item_id` int(8) unsigned NOT NULL AUTO_INCREMENT,
      `cnt` int(8) unsigned NOT NULL,
      `timestamp` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP,
      PRIMARY KEY (`item_id`)
    ) ENGINE=InnoDB AUTO_INCREMENT=4 DEFAULT CHARSET=latin1
  - rows: 3
  - table_size_mb: 0.015625

------------------------------------------------------------
high_offset_selects → table affected: page

✗ "SELECT /* CategoryPaginationViewer::processSection..." query uses too high offset impacting the performance

  - query: SELECT /* CategoryPaginationViewer::processSection */  page_namespace,page_title,page_len,page_is_redirect,cl_sortkey_prefix  FROM `page` INNER JOIN `categorylinks` FORCE INDEX (cl_sortkey) ON ((cl_from = page_id))  WHERE cl_type = 'page' AND cl_to = 'Spotify/Song'  ORDER BY cl_sortkey LIMIT 927600,200
  - limit: 200
  - offset: 927600

------------------------------------------------------------
empty_database → table affected: index_digest_empty

✗ "index_digest_empty" database has no tables

------------------------------------------------------------
Queries performed: 100

Success stories

Want to add your entry here? Submit a pull request

Read more

Slides