Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Nova services adoption (no extra cell) #176

Closed
wants to merge 3 commits into from

Conversation

bogdando
Copy link
Contributor

@bogdando bogdando commented Oct 9, 2023

first split (just quick and dirty Nova adoption) done #191
ffu split done #192 - the new commit on top cf59540
pre/post checks changes extracted here #193 - commit 2cc55f3


Note about remapping cell DB names from OSP cells naming scheme
to the NG scheme with the superconductor layout.

Add a step to rename default cell as cell1, and to delete stale
Nova services records from cell1 DB during initial databases import,
to properly transition it into a superconductor layout later on.

Adjust minor gaps in the dependencies adoption docs (Placement,
Nova cells DB, OVN etc.)

Address the switch for service overrides spec instead of
externalEndpoints, where it is missing on the path to Nova adotpion.

Remove Nova Metadata secret creation workarounds from the EDPM
adotopion docs and test suits.

Provide workaround for renaming 'default' cell's DB during adoption.

Add test suits for Nova CP services adoption.

Update EDPM adoption docs and tests to execute Nova compute post-FFU.

Add missing nova and libvirt services for the edpm adoption tests.

Verify no dataplane disruptions during the adoption and upgrade
process.

Verify Nova services still control pre-created VM workload after
FFU/adotpion is done.

Update and fix the composition of the services pre-check list to
execute it before stopping services.

Update and fix the composition of the list of the services to be
stopped (cannot pull data from stopped services).

Stop Nova services in stop_openstack_services instead of edpm_adoption
(that was too late to do that).

Get services topology specific configuration in
pull_openstack_configuration. Add missing role for that as well.

Also note about cleaning up delorean repos for tripleo standalone dev
env.

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,3bf28621f91f582433f56ab3f197efa68b018eea

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,584b48713a32fc46c11ce3d1a8a7c260790409bd

@bogdando bogdando changed the title WIP Document Nova services adoption with an extra cell WIP Document Nova services adoption (no extra cell) Oct 10, 2023
@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,908f41b4d0a40d42e200e6c4366cb19b4aae7c3d

@bogdando bogdando requested a review from SeanMooney October 10, 2023 15:59
docs/contributing/nova.md Outdated Show resolved Hide resolved
docs/contributing/nova.md Outdated Show resolved Hide resolved
docs/contributing/nova.md Outdated Show resolved Hide resolved
docs/contributing/nova.md Outdated Show resolved Hide resolved
@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,827b289d11bf8a9b2eab60e07216831315e04989

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,a4903bb4fd2b936a77a900b6835a96c208109595

@fao89

This comment was marked as resolved.

@softwarefactory-project-zuul
Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,3def92ce9592c68ab4d9ad562591ac62b0a466fb

@bogdando
Copy link
Contributor Author

For dataplane, we would need to drop the metadata workaround: https://github.com/openstack-k8s-operators/data-plane-adoption/blob/main/docs/openstack/edpm_adoption.md#procedure---edpm-adoption

and add libvirt and nova to OpenStackDataPlaneNodeSet.services

done

@bogdando bogdando requested a review from jistr October 11, 2023 14:55
@bogdando bogdando force-pushed the OSPRH-338 branch 4 times, most recently from 7a45acc to d24d826 Compare October 12, 2023 13:12
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/dcbaff645f6b4dff89affddb666903fd

data-plane-adoption-github-rdo-centos-9-crc-single-node FAILURE in 1h 07m 15s
data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 1h 24m 03s

@jistr
Copy link
Contributor

jistr commented Oct 31, 2023

There is a problem with a check for the adopted VM instance - nova CLI always shows it is running, even if there is no more qemu process for it running on the EDPM compute. And start/stop actions do nothing. This needs to be investigated, also this extra check needs to be documented

FWIW, if you'd like to merge this in multiple chunks, e.g. first ensuring that the control plane comes up fine and data plane Ansible executes successfully, and then have another story tracking "Make sure the workload survives undamaged", i think that would be fine too.

@bogdando bogdando force-pushed the OSPRH-338 branch 3 times, most recently from 9f9be89 to 222e02e Compare October 31, 2023 15:56
@bogdando
Copy link
Contributor Author

There is a problem with a check for the adopted VM instance - nova CLI always shows it is running, even if there is no more qemu process for it running on the EDPM compute. And start/stop actions do nothing. This needs to be investigated, also this extra check needs to be documented

FWIW, if you'd like to merge this in multiple chunks, e.g. first ensuring that the control plane comes up fine and data plane Ansible executes successfully, and then have another story tracking "Make sure the workload survives undamaged", i think that would be fine too.

I'll do my best to split this into commits

@bogdando
Copy link
Contributor Author

There is a problem with a check for the adopted VM instance - nova CLI always shows it is running, even if there is no more qemu process for it running on the EDPM compute. And start/stop actions do nothing. This needs to be investigated, also this extra check needs to be documented

FWIW, if you'd like to merge this in multiple chunks, e.g. first ensuring that the control plane comes up fine and data plane Ansible executes successfully, and then have another story tracking "Make sure the workload survives undamaged", i think that would be fine too.

This is a single unit of work according to the Nova team feadback
@gibizer @SeanMooney

I can split this PR into commits for simplicity of reviewing it. But cannot split jira stories.

FFU is the only target state we agreed to accept, cannot stop in intermediate states.

@bogdando bogdando closed this Oct 31, 2023
@bogdando bogdando reopened this Oct 31, 2023
@bogdando
Copy link
Contributor Author

An update: this works now for my testing env. I'm going to respin it from the beginning just to confirm I didn't break the mariadb related checks (I had to move them to the pull openstack configuration steps, before we stop tripleo services).

Then, I will switch to recomposition of commits w/o introducing functional changes.
Doing both - recomposition AND func changes would be a nightmare to maintain this until merged

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/4b4a338d4ee6472b9e6901ba8c5d7969

data-plane-adoption-github-rdo-centos-9-crc-single-node RETRY_LIMIT in 7m 50s
data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 1h 20m 39s

pinikomarov pushed a commit to pinikomarov/dataplane-operator that referenced this pull request Nov 4, 2023
"a lowercase RFC 1123 subdomain must consist of lower case alphanumeric characters, '-' or '.', and must start and end with an alphanumeric character "
ref:  openstack-k8s-operators/data-plane-adoption#176 (comment)

When Label wasn't provided it was breaking the AEE deploy

Signed-off-by: Fabricio Aguiar <[email protected]>
@bogdando
Copy link
Contributor Author

bogdando commented Nov 6, 2023

There is a problem with a check for the adopted VM instance - nova CLI always shows it is running, even if there is no more qemu process for it running on the EDPM compute. And start/stop actions do nothing. This needs to be investigated, also this extra check needs to be documented

FWIW, if you'd like to merge this in multiple chunks, e.g. first ensuring that the control plane comes up fine and data plane Ansible executes successfully, and then have another story tracking "Make sure the workload survives undamaged", i think that would be fine too.

first split (just quick and dirty Nova adoption) done #191
ffu split done #192 - the new commit on top cf59540
pre/post checks changes extracted here #193 - commit 2cc55f3

@jistr @SeanMooney @GIBI PTAL

Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://review.rdoproject.org/zuul/buildset/465fc5da92ec425f916cdc8986a86100

data-plane-adoption-github-rdo-centos-9-extracted-crc FAILURE in 1h 30m 48s

docs/openstack/nova_adoption.md Show resolved Hide resolved
docs/openstack/nova_adoption.md Show resolved Hide resolved
docs/openstack/edpm_adoption.md Show resolved Hide resolved
docs/openstack/edpm_adoption.md Show resolved Hide resolved
docs/openstack/edpm_adoption.md Outdated Show resolved Hide resolved
```bash
oc exec -it mariadb-openstack-cell1 -- mysql --user=root --password=${PODIFIED_DB_ROOT_PASSWORD} \
-e "select a.version from nova_cell1.services a join nova_cell1.services b where a.version!=b.version and a.binary='nova-compute';"
```
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At this point you can observe the the compute was able to report status to the new control plane so the service is now UP:

[gibi@osp-dev-01 ~]$ openstack compute service list
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| ID                                   | Binary         | Host                   | Zone     | Status  | State | Updated At                 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
| 0954e21c-9022-4718-a570-a7d3eb0fd79f | nova-conductor | nova-cell0-conductor-0 | internal | enabled | up    | 2023-11-09T09:51:43.000000 |
| c3ad2def-18ed-49e5-8af4-a7c1a0840171 | nova-scheduler | nova-scheduler-0       | internal | enabled | up    | 2023-11-09T09:51:39.000000 |
| a7a20d50-b85d-4321-a576-5a12fea9bc8f | nova-compute   | standalone.localdomain | nova     | enabled | up    | 2023-11-09T09:51:41.000000 |
| 9eb053f9-a404-4b02-92a8-d2a5fe339849 | nova-conductor | nova-cell1-conductor-0 | internal | enabled | up    | 2023-11-09T09:51:45.000000 |
+--------------------------------------+----------------+------------------------+----------+---------+-------+----------------------------+
[gibi@osp-dev-01 ~]$ openstack hypervisor list
+--------------------------------------+------------------------+-----------------+-----------------+-------+
| ID                                   | Hypervisor Hostname    | Hypervisor Type | Host IP         | State |
+--------------------------------------+------------------------+-----------------+-----------------+-------+
| d3d2be51-a0b9-4538-a298-62280a52fece | standalone.localdomain | QEMU            | 192.168.122.100 | up    |
+--------------------------------------+------------------------+-----------------+-----------------+-------+

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get this, sorry. What is expected to change along these lines?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean you can add a check here that shows the compute is UP from the nova-api perspective

* Verify if Nova services control the existing VM instance:

```bash
openstack server list | grep -qF '| test | ACTIVE |' && openstack server stop test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something is wrong as at this point nova-compute produces a stack trace

2023-11-09 09:55:54.826 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "4191c6c5-7c94-4715-ab88-64b27a7ad2c6" "released" by "nova.compute.manager.ComputeManager.stop_instance.<locals>.do_stop_instance" :: held 3.115s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:423
2023-11-09 09:55:54.978 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Acquiring lock "compute_resources" by "nova.compute.resource_tracker.ResourceTracker.update_usage" inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:404
2023-11-09 09:55:54.980 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "compute_resources" acquired by "nova.compute.resource_tracker.ResourceTracker.update_usage" :: waited 0.002s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:409
2023-11-09 09:55:55.070 2 DEBUG nova.compute.provider_tree [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Inventory has not changed in ProviderTree for provider: d3d2be51-a0b9-4538-a298-62280a52fece update_inventory /usr/lib/python3.9/site-packages/nova/compute/provider_tree.py:180
2023-11-09 09:55:55.098 2 DEBUG nova.scheduler.client.report [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Inventory has not changed for provider d3d2be51-a0b9-4538-a298-62280a52fece based on inventory data: {'VCPU': {'total': 8, 'reserved': 0, 'min_unit': 1, 'max_unit': 8, 'step_size': 1, 'allocation_ratio': 16.0}, 'MEMORY_MB': {'total': 19744, 'reserved': 512, 'min_unit': 1, 'max_unit': 19744, 'step_size': 1, 'allocation_ratio': 1.0}, 'DISK_GB': {'total': 69, 'reserved': 1, 'min_unit': 1, 'max_unit': 69, 'step_size': 1, 'allocation_ratio': 1.0}} set_inventory_for_provider /usr/lib/python3.9/site-packages/nova/scheduler/client/report.py:940
2023-11-09 09:55:55.103 2 DEBUG oslo_concurrency.lockutils [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Lock "compute_resources" "released" by "nova.compute.resource_tracker.ResourceTracker.update_usage" :: held 0.123s inner /usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py:423
2023-11-09 09:55:55.104 2 INFO nova.compute.manager [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] [instance: 4191c6c5-7c94-4715-ab88-64b27a7ad2c6] Successfully reverted task state from powering-off on failure for instance.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server [None req-b77b398c-9fc4-49fe-95c8-1f6761293777 1d1bd1b129a54c88a4232738e354fbb3 ad151be8d46d451b82f31b39d674565f - - default default] Exception during message handling: nova.exception.InstanceNotFound: Instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 could not be found.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 690, in _get_domain
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return conn.lookupByUUIDString(instance.uuid)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 193, in doit
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     result = proxy_call(self._autowrap, f, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 151, in proxy_call
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     rv = execute(f, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 132, in execute
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     six.reraise(c, e, tb)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/six.py", line 709, in reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     raise value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/eventlet/tpool.py", line 86, in tworker
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     rv = meth(*args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib64/python3.9/site-packages/libvirt.py", line 5008, in lookupByUUIDString
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     raise libvirtError('virDomainLookupByUUIDString() failed')
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server libvirt.libvirtError: Domain not found: no domain with matching uuid '4191c6c5-7c94-4715-ab88-64b27a7ad2c6'
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server 
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server During handling of the above exception, another exception occurred:
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server 
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server Traceback (most recent call last):
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/server.py", line 165, in _process_incoming
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     res = self.dispatcher.dispatch(message)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 309, in dispatch
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return self._do_dispatch(endpoint, method, ctxt, args)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_messaging/rpc/dispatcher.py", line 229, in _do_dispatch
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     result = func(ctxt, **new_args)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 71, in wrapped
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     _emit_versioned_exception_notification(
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     self.force_reraise()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     raise self.value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/exception_wrapper.py", line 63, in wrapped
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return f(self, context, *args, **kw)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 186, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     LOG.warning("Failed to revert task state for instance. "
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 227, in __exit__
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     self.force_reraise()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_utils/excutils.py", line 200, in force_reraise
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     raise self.value
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 157, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/utils.py", line 1439, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 203, in decorated_function
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return function(self, context, *args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3381, in stop_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     do_stop_instance()
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/oslo_concurrency/lockutils.py", line 414, in inner
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return f(*args, **kwargs)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3369, in do_stop_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     self._power_off_instance(instance, clean_shutdown)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/compute/manager.py", line 3076, in _power_off_instance
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     self.driver.power_off(instance, timeout, retry_interval)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 4099, in power_off
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     self._clean_shutdown(instance, timeout, retry_interval)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/driver.py", line 4059, in _clean_shutdown
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     guest = self._host.get_guest(instance)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 674, in get_guest
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     return libvirt_guest.Guest(self._get_domain(instance))
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server   File "/usr/lib/python3.9/site-packages/nova/virt/libvirt/host.py", line 694, in _get_domain
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server     raise exception.InstanceNotFound(instance_id=instance.uuid)
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server nova.exception.InstanceNotFound: Instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 could not be found.
2023-11-09 09:55:55.107 2 ERROR oslo_messaging.rpc.server 
2023-11-09 09:55:56.053 2 DEBUG ovsdbapp.backend.ovs_idl.vlog [-] [POLLIN] on fd 19 __log_wakeup /usr/lib64/python3.9/site-packages/ovs/poller.py:263

Then I cannot start the instance up again

[gibi@osp-dev-01 ~]$ openstack server start test
Cannot 'start' instance 4191c6c5-7c94-4715-ab88-64b27a7ad2c6 while it is in vm_state active (HTTP 409) (Request-ID: req-eb72c115-a9da-4e47-acbc-2457ddbf7607)
command terminated with exit code 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is weird, I haven't observed that during my testing :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to go an reproduce it. I have a feeling that the cleanup of old libvirt services was incomplete in my case

Copy link

Merge Failed.

This change or one of its cross-repo dependencies was unable to be automatically merged with the current state of its repository. Please rebase the change and upload a new patchset.
Warning:
Error merging github.com/openstack-k8s-operators/data-plane-adoption for 176,96cb19c55f49082692bb5a99fca65a72622db47e

Note about remapping cell DB names from OSP cells naming scheme
to the NG scheme with the superconductor layout.

Add a step to rename default cell as cell1, and to delete stale
Nova services records from cell1 DB during initial databases import,
to properly transition it into a superconductor layout later on.

Adjust minor gaps in the dependencies adoption docs (Placement,
Nova cells DB, OVN etc.)

Address the switch for service overrides spec instead of
externalEndpoints, where it is missing on the path to Nova adotpion.

Remove Nova Metadata secret creation workarounds from the EDPM
adotopion docs and test suits.

Provide workaround for renaming 'default' cell's DB during adoption.

Add test suits for Nova CP services adoption.

Update EDPM adoption docs and tests to execute Nova compute post-FFU.

Add missing nova and libvirt services for the edpm adoption tests.

Verify no dataplane disruptions during the adoption and upgrade
process.

Verify Nova services still control pre-created VM workload after
FFU/adotpion is done.

Update and fix the composition of the services pre-check list to
execute it before stopping services.

Update and fix the composition of the list of the services to be
stopped (cannot pull data from stopped services).

Stop Nova services in stop_openstack_services instead of edpm_adoption
(that was too late to do that).

Get services topology specific configuration in
pull_openstack_configuration. Add missing role for that as well.

Also note about cleaning up delorean repos for tripleo standalone dev
env.

Signed-off-by: Bohdan Dobrelia <[email protected]>
Signed-off-by: Bohdan Dobrelia <[email protected]>
Signed-off-by: Bohdan Dobrelia <[email protected]>
@bogdando
Copy link
Contributor Author

please move review to the PRs splitted out of this one

@bogdando bogdando closed this Nov 13, 2023
@bogdando bogdando deleted the OSPRH-338 branch November 13, 2023 14:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants