[DPE-4114] Test: Scale to zero units #347

BalabaDmitri · 2024-02-06T16:49:41Z

Issue #445

Solution

Test coverage of the following cases:

Scaling with and without storage.
Whether writes are increasing.
The shutdown of units after scale to zero.
Scaling up to 1 unit with attach own storage and check data.
Scaling up to 2 units with attach the another cluster's storage.
Charm is blocked.
Remove unit with another cluster's storage.
Scaling up to 3 units without storage and check data.
Check storage provided has been reused.

Test not coverage of the following cases:

Scaling to 2 units with different postgresql versions.

Implementation

Handling behavior when using another cluster's storage.
Handling behavior when using different postgresql versions.

codecov · 2024-02-07T14:22:54Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.94%. Comparing base (0f2c8c2) to head (9d7bfed).
Report is 24 commits behind head on main.

❗ Current head 9d7bfed differs from pull request most recent head e670781

Please upload reports for the commit e670781 to get more accurate results.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #347      +/-   ##
==========================================
- Coverage   80.31%   79.94%   -0.37%     
==========================================
  Files          10       10              
  Lines        2301     2169     -132     
  Branches      376      344      -32     
==========================================
- Hits         1848     1734     -114     
+ Misses        369      368       -1     
+ Partials       84       67      -17

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

taurus-forever

@BalabaDmintri Thank you for the contribution. We need to add a ticket number and PR body describing What?Where?Why? better.

Re:PR. It is overall good start!!! We need to cover all scaling cases:

re-scaling-back with and without drive. E.g. the default hostpath storage will remove drive, so re-scaling will start on empty disk. In the same time btrfs will keep disk available and re-scaling will be with a disk (could be default disk or manually provided). Could be wrongly provided...

I mean we have those cases:

user wants to restore the same storage: remove-unit postgresql/3 && && add-unit -n 2 should reuse old storage as user didn't specify it (it can be new disk, e.g. hostpath). => should be OK, long SST/clone process for 2nd+ node, but 0->1 is new DB init as no disk.
user wants to provide specific correct storage => should be OK automatically, just fast resilvering.
user made mistake and specified wrong storage (another cluster) => charm blocked. do NOT corrupt foreign disks!!!
user made mistake and specified wrong storage (empty/corrupter/no psql data on it) => block to be on a safe side. force user to remove disk and re-scale without any disks.
...

I believe we should test all those cases above.

P.S. Charm should be safe and block bootstrap if found foreign cluster on disk, etc (probable improvement here required).

P.P.S. on top of this: it can be disk from the different charm revision OR different PostgreSQL version. We need to test it too. block or accept is a question to @7annaba3l :-D

taurus-forever · 2024-02-07T14:32:00Z

tests/integration/ha_tests/test_self_healing.py

+
+    # Scale the database to three units.
+    for store_id in storage_id_list:
+        await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)


@marceloneppel

JFMI, should re use ops lib directly as in helper it refers to this which is resolved:

Note: this function exists as a temporary solution until this issue is resolved: https://github.com/juju/python-libjuju/issues/695

Yes. We should remove the workaround and use the methods provided by the lib.

IIRC this is only available in libjuju 3

taurus-forever · 2024-02-09T08:27:41Z

BTW, @BalabaDmintri you need to sign CLA and fix lint tests here:

lint: commands[1]> poetry run codespell /home/runner/work/postgresql-operator/postgresql-operator/src /home/runner/work/postgresql-operator/postgresql-operator/tests
lint: commands[2]> poetry run ruff /home/runner/work/postgresql-operator/postgresql-operator/src /home/runner/work/postgresql-operator/postgresql-operator/tests
tests/integration/ha_tests/test_self_healing.py:4:1: I001 [*] Import block is un-sorted or un-formatted
tests/integration/ha_tests/test_self_healing.py:604:55: F841 [*] Local variable `e` is assigned to but never used
tests/integration/ha_tests/test_self_healing.py:618:33: W292 [*] No newline at end of file
Found 3 errors.
[*] 3 fixable with the `--fix` option.

Please check https://github.com/canonical/postgresql-operator/blob/main/CONTRIBUTING.md

marceloneppel · 2024-02-09T19:57:03Z

tests/integration/ha_tests/test_self_healing.py

+                channel="edge",
+            )
+
+    # Deploy the continuous writes application charm if it wasn't already deployed.


This part can be removed, as the continuous writes application is already deployed by test_build_and_deploy.

marceloneppel · 2024-02-09T19:57:50Z

tests/integration/ha_tests/test_self_healing.py

+            )
+
+    if wait_for_apps:
+        await ops_test.model.wait_for_idle(status="active", timeout=3000)


After the above comment is handled, this line can be moved close to the deployment of the PostgreSQL application.

marceloneppel · 2024-02-09T20:04:46Z

tests/integration/ha_tests/test_self_healing.py

+
+    # Scale the database to three units.
+    for store_id in storage_id_list:
+        await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)


Yes. We should remove the workaround and use the methods provided by the lib.

delgod · 2024-02-28T15:42:49Z

tests/integration/ha_tests/test_self_healing.py

+
+    # Scale the database to one unit.
+    logger.info("scaling database to one unit")
+    await add_unit_with_storage(ops_test, storage=primary_storage, app=APP_NAME)


After the unit starts, we should check if the data on the storage has been actually restored.

delgod · 2024-02-28T15:45:30Z

tests/integration/ha_tests/test_self_healing.py

+    # Scale the database to three units.
+    for store_id in storage_id_list:
+        await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME)
+    await check_writes(ops_test)


After 2nd and 3rd units start, it is needed to check that data on them is restored from WAL (not via backup/restore).
Maybe @dragomirp or @marceloneppel know how to check this

ha_test.helpers.reused_replica_storage() and ha_test.helpers.reused_full_cluster_recovery_storage() should do the trick.

…nits, check no errors, scale to 3, check

taurus-forever · 2024-04-05T13:34:18Z

Discussed this today on the sync. For the history:

our Moto: Never destroy customer data!
The goal for this test is to block the charm IF user made a mistake and re-scale application with wrong storage. IF user want to re-deploy application using foreign storage: it is possible (with warning in logs), but re-scaling up with foreign storage must be impossible.

Some example in my mind:

deploy 3 units, generate data, create backup, generate data.
scale to 2 unit, change data, scale-up to 3 units (with dirty storage of 3rd unit) => cluster member should join well
scale to 2 unit, restore backup, scale-up to 3 units (with dirty storage of 3rd unit) => cluster member should be blocked as patroni will see the data but will not be able to join the cluster. Fix: scale down, remove 3rd storage, scale-up (automated re-silvering used). All OK.
scale to 2 units, change 3rd storage to foreign cluster, scale-up => unit must be blocked as foreign data detected. Better safe then sorry.
...

BalabaDmitri · 2024-04-18T06:31:41Z

@marceloneppel Can you run actions

tests/integration/helpers.py

tests/integration/ha_tests/test_self_healing.py

marceloneppel · 2024-04-17T17:43:21Z

tests/integration/ha_tests/helpers.py

+async def get_db_connection(ops_test, dbname, is_primary=True, replica_unit_name=""):
+    unit_name = await get_primary(ops_test, APP_NAME)


You may add the type hint for the returned values and a docstring to make the output even easier to understand. The same also applies to the other functions you created in this file.

src/charm.py

tests/integration/ha_tests/test_self_healing.py

taurus-forever

LGTM, this is something we were missing. Tnx!

taurus-forever · 2024-06-07T10:14:19Z

tests/integration/ha_tests/test_self_healing.py

+    logger.info("database scaling up to two units using third-party cluster storage")
+    new_unit = await add_unit_with_storage(
+        ops_test, app=app, storage=second_storage, is_blocked=True
+    )


nit: IMHO, it worth to check: we are blocked with the right message (foreign disk).

… into deployment-zero-units

BalabaDmitri · 2024-06-13T16:29:53Z

@dragomirp Can you check

dragomirp · 2024-06-13T16:41:10Z

Hi, @BalabaDmitri, can you resync with main again, to retrigger the CI and to get some fixes. Sorry for asking again.

dragomirp · 2024-06-17T12:57:49Z

Hi, @BalabaDmitri, the new test seems to frequently fail with:

unit-postgresql-2: 14:28:15 ERROR unit.postgresql/2.juju-log database-peers:2: Uncaught exception while in charm code:
Traceback (most recent call last):
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 1688, in <module>
    main(PostgresqlOperatorCharm)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 544, in main
    manager.run()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 520, in run
    self._emit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/main.py", line 506, in _emit
    self.framework.reemit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/framework.py", line 861, in reemit
    self._reemit()
  File "/var/lib/juju/agents/unit-postgresql-2/charm/venv/ops/framework.py", line 941, in _reemit
    custom_handler(event)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 393, in _on_peer_relation_departed
    member_ip = self._patroni.get_member_ip(departing_member)
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 728, in _patroni
    self._peer_members_ips,
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 758, in _peer_members_ips
    addresses = self.members_ips
  File "/var/lib/juju/agents/unit-postgresql-2/charm/./src/charm.py", line 779, in members_ips
    return set(json.loads(self._peers.data[self.app].get("members_ips", "[]")))
AttributeError: 'NoneType' object has no attribute 'data'

E.g. here

I've also seen it fail the assertion for writes continuing:

Traceback (most recent call last):
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/helpers.py", line 110, in are_writes_increasing
    more_writes[member] > count
AssertionError: test.postgresql-4: writes not continuing to DB (current writes: 612 - previous writes: 612)

line 529, in inner
    _loop.run_until_complete(task)
  File "/usr/lib/python3.10/asyncio/base_events.py", line 649, in run_until_complete
    return future.result()
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/test_self_healing.py", line 631, in test_deploy_zero_units
    await are_writes_increasing(ops_test)
  File "/home/runner/work/postgresql-operator/postgresql-operator/tests/integration/ha_tests/helpers.py", line 101, in are_writes_increasing
    for attempt in Retrying(stop=stop_after_delay(60 * 3), wait=wait_fixed(3)):
  File "/home/runner/work/postgresql-operator/postgresql-operator/.tox/integration/lib/python3.10/site-packages/tenacity/__init__.py", line 347, in __iter__
    do = self.iter(retry_state=retry_state)
  File "/home/runner/work/postgresql-operator/postgresql-operator/.tox/integration/lib/python3.10/site-packages/tenacity/__init__.py", line 326, in iter
    raise retry_exc from fut.exception()

E.g. here

Please, take another look.

deployment test "zero-units"

9d7bfed

dragomirp requested review from taurus-forever, dragomirp and marceloneppel February 6, 2024 16:53

taurus-forever requested changes Feb 7, 2024

View reviewed changes

marceloneppel requested changes Feb 9, 2024

View reviewed changes

delgod requested changes Feb 28, 2024

View reviewed changes

BalabaDmitri added 13 commits March 4, 2024 14:52

deployment test "zero-units"

938035f

Zero-units: continuous writes ON, deploy 3 units, check, scale to 0 u…

7dc328b

…nits, check no errors, scale to 3, check

Zero-units: continuous writes ON, deploy 3 units, check, scale to 0 u…

b762ec8

…nits, check no errors, scale to 3, check

Zero-units: continuous writes ON, deploy 3 units, check, scale to 0 u…

0ca9740

…nits, check no errors, scale to 3, check

run format & lint

04bc51c

reduce time out

171b53f

merge from remote main

27c97f4

remove replication storage list

8382d0d

checking after scale to 2 and checking after scale up to 3

d467d8c

checking after scale to 2 and checking after scale up to 3

526357b

checking after scale to 2 and checking after scale up to 3

4b64ce9

run format & lint

927ad24

handle error: storage belongs to different cluster

a18b1d3

BalabaDmitri force-pushed the deployment-zero-units branch from f9113e2 to a18b1d3 Compare April 4, 2024 12:56

handle error: storage belongs to different cluster

18211ed

BalabaDmitri force-pushed the deployment-zero-units branch from 3b785a2 to a63da74 Compare April 12, 2024 14:18

BalabaDmitri requested review from taurus-forever and marceloneppel April 12, 2024 14:48

handling different versions of Postgres of unit

d917d88

BalabaDmitri force-pushed the deployment-zero-units branch from a63da74 to d917d88 Compare April 17, 2024 08:54

BalabaDmitri changed the title ~~deployment test "zero-units"~~ [DPE-4114] Test: Scale to zero units Apr 17, 2024

fix unit fixed setting postgresql version into app_peer_data

0a0486f

BalabaDmitri force-pushed the deployment-zero-units branch from 4738de9 to 0a0486f Compare April 17, 2024 21:18

merge canonical/main

ab160f3

format

263a1ef

marceloneppel reviewed Apr 26, 2024

View reviewed changes

BalabaDmitri added 3 commits April 29, 2024 23:22

fix record of postgres version in databags

a1b24dd

Merge remote-tracking branch 'canorigin/main' into deployment-zero-units

19574bd

format & lint

6873326

BalabaDmitri force-pushed the deployment-zero-units branch from a39b446 to 6873326 Compare April 29, 2024 20:31

marceloneppel approved these changes May 2, 2024

View reviewed changes

merge canonical/postgresql-operator

6716eaf

taurus-forever approved these changes Jun 7, 2024

View reviewed changes

BalabaDmitri added 3 commits June 10, 2024 11:44

Merge remote-tracking branch 'canorigin/main' into deployment-zero-units

41bfc2f

checking blocked status based using blocking message

2b7db14

Merge branch 'main' of https://github.com/canonical/postgresql-operator…

ef84bf6

… into deployment-zero-units

Merge branch 'canonical_main' into deployment-zero-units

e670781

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DPE-4114] Test: Scale to zero units #347

[DPE-4114] Test: Scale to zero units #347

BalabaDmitri commented Feb 6, 2024 •

edited

Loading

codecov bot commented Feb 7, 2024 •

edited

Loading

taurus-forever left a comment •

edited

Loading

taurus-forever Feb 7, 2024

marceloneppel Feb 9, 2024

dragomirp Mar 4, 2024

taurus-forever commented Feb 9, 2024

marceloneppel Feb 9, 2024

marceloneppel Feb 9, 2024

marceloneppel Feb 9, 2024

delgod Feb 28, 2024

delgod Feb 28, 2024

dragomirp Feb 28, 2024

taurus-forever commented Apr 5, 2024

BalabaDmitri commented Apr 18, 2024

marceloneppel Apr 17, 2024

taurus-forever left a comment

taurus-forever Jun 7, 2024

BalabaDmitri commented Jun 13, 2024

dragomirp commented Jun 13, 2024

dragomirp commented Jun 17, 2024

		async def get_db_connection(ops_test, dbname, is_primary=True, replica_unit_name=""):
		unit_name = await get_primary(ops_test, APP_NAME)

[DPE-4114] Test: Scale to zero units #347

Are you sure you want to change the base?

[DPE-4114] Test: Scale to zero units #347

Conversation

BalabaDmitri commented Feb 6, 2024 • edited Loading

Test coverage of the following cases:

Test not coverage of the following cases:

Implementation

codecov bot commented Feb 7, 2024 • edited Loading

Codecov Report

taurus-forever left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taurus-forever commented Feb 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

taurus-forever commented Apr 5, 2024

BalabaDmitri commented Apr 18, 2024

Choose a reason for hiding this comment

taurus-forever left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BalabaDmitri commented Jun 13, 2024

dragomirp commented Jun 13, 2024

dragomirp commented Jun 17, 2024

BalabaDmitri commented Feb 6, 2024 •

edited

Loading

codecov bot commented Feb 7, 2024 •

edited

Loading

taurus-forever left a comment •

edited

Loading