-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DPE-4114] Test: Scale to zero units #347
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #347 +/- ##
==========================================
- Coverage 80.31% 79.94% -0.37%
==========================================
Files 10 10
Lines 2301 2169 -132
Branches 376 344 -32
==========================================
- Hits 1848 1734 -114
+ Misses 369 368 -1
+ Partials 84 67 -17 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@BalabaDmintri Thank you for the contribution. We need to add a ticket number and PR body describing What?Where?Why? better.
Re:PR. It is overall good start!!! We need to cover all scaling cases:
re-scaling-back with and without drive. E.g. the default hostpath storage will remove drive, so re-scaling will start on empty disk. In the same time btrfs will keep disk available and re-scaling will be with a disk (could be default disk or manually provided). Could be wrongly provided...
I mean we have those cases:
- user wants to restore the same storage:
remove-unit postgresql/3 && && add-unit -n 2
should reuse old storage as user didn't specify it (it can be new disk, e.g. hostpath). => should be OK, long SST/clone process for 2nd+ node, but 0->1 is new DB init as no disk. - user wants to provide specific correct storage => should be OK automatically, just fast resilvering.
- user made mistake and specified wrong storage (another cluster) => charm blocked. do NOT corrupt foreign disks!!!
- user made mistake and specified wrong storage (empty/corrupter/no psql data on it) => block to be on a safe side. force user to remove disk and re-scale without any disks.
- ...
I believe we should test all those cases above.
P.S. Charm should be safe and block bootstrap if found foreign cluster on disk, etc (probable improvement here required).
P.P.S. on top of this: it can be disk from the different charm revision OR different PostgreSQL version. We need to test it too. block or accept is a question to @7annaba3l :-D
|
||
# Scale the database to three units. | ||
for store_id in storage_id_list: | ||
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
JFMI, should re use ops lib directly as in helper it refers to this which is resolved:
Note: this function exists as a temporary solution until this issue is resolved:
https://github.com/juju/python-libjuju/issues/695
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should remove the workaround and use the methods provided by the lib.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIRC this is only available in libjuju 3
BTW, @BalabaDmintri you need to sign CLA and fix lint tests here:
Please check https://github.com/canonical/postgresql-operator/blob/main/CONTRIBUTING.md |
channel="edge", | ||
) | ||
|
||
# Deploy the continuous writes application charm if it wasn't already deployed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part can be removed, as the continuous writes application is already deployed by test_build_and_deploy
.
) | ||
|
||
if wait_for_apps: | ||
await ops_test.model.wait_for_idle(status="active", timeout=3000) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the above comment is handled, this line can be moved close to the deployment of the PostgreSQL application.
|
||
# Scale the database to three units. | ||
for store_id in storage_id_list: | ||
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. We should remove the workaround and use the methods provided by the lib.
|
||
# Scale the database to one unit. | ||
logger.info("scaling database to one unit") | ||
await add_unit_with_storage(ops_test, storage=primary_storage, app=APP_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the unit starts, we should check if the data on the storage has been actually restored.
# Scale the database to three units. | ||
for store_id in storage_id_list: | ||
await add_unit_with_storage(ops_test, storage=store_id, app=APP_NAME) | ||
await check_writes(ops_test) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After 2nd and 3rd units start, it is needed to check that data on them is restored from WAL (not via backup/restore).
Maybe @dragomirp or @marceloneppel know how to check this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ha_test.helpers.reused_replica_storage() and ha_test.helpers.reused_full_cluster_recovery_storage() should do the trick.
…nits, check no errors, scale to 3, check
…nits, check no errors, scale to 3, check
…nits, check no errors, scale to 3, check
f9113e2
to
a18b1d3
Compare
Discussed this today on the sync. For the history:
Some example in my mind:
|
3b785a2
to
a63da74
Compare
a63da74
to
d917d88
Compare
4738de9
to
0a0486f
Compare
@marceloneppel Can you run actions |
async def get_db_connection(ops_test, dbname, is_primary=True, replica_unit_name=""): | ||
unit_name = await get_primary(ops_test, APP_NAME) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You may add the type hint for the returned values and a docstring to make the output even easier to understand. The same also applies to the other functions you created in this file.
a39b446
to
6873326
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, this is something we were missing. Tnx!
logger.info("database scaling up to two units using third-party cluster storage") | ||
new_unit = await add_unit_with_storage( | ||
ops_test, app=app, storage=second_storage, is_blocked=True | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: IMHO, it worth to check: we are blocked with the right message (foreign disk).
@dragomirp Can you check |
Hi, @BalabaDmitri, can you resync with main again, to retrigger the CI and to get some fixes. Sorry for asking again. |
Hi, @BalabaDmitri, the new test seems to frequently fail with:
E.g. here I've also seen it fail the assertion for writes continuing:
E.g. here Please, take another look. |
Issue #445
Solution
Test coverage of the following cases:
Test not coverage of the following cases:
Implementation