Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When rpm package installation fails during Import Cluster operation, information about this problem is not provided #627

Closed
mbukatov opened this issue Sep 20, 2017 · 9 comments
Assignees

Comments

@mbukatov
Copy link
Contributor

Description

When rpm package installation fails during Import Cluster operation, information about this problem nor more details behind it is not provided. One can only guess what has happened based on errors which consequently follow.

Version

Latest snapshot build from master branch (it's part of upcoming 1.5.2 version):

tendrl-api-1.5.2-20170919T145901.a6634f7.noarch                                 
tendrl-api-httpd-1.5.2-20170919T145901.a6634f7.noarch                           
tendrl-commons-1.5.2-20170919T200626.2f02c26.noarch                             
tendrl-grafana-plugins-1.5.2-20170920T123413.1e2a7e6.noarch                     
tendrl-monitoring-integration-1.5.2-20170920T123413.1e2a7e6.noarch              
tendrl-node-agent-1.5.2-20170919T185441.f88ff0f.noarch                          
tendrl-notifier-1.5.2-20170920T065358.f156955.noarch                            
tendrl-ui-1.5.2-20170920T143459.2bcbd93.noarch 

Steps to reproduce

  1. Prepare machines with GlusterFS cluster, including gluster volume (I used nightly builds and volume_usmqe_alpha_distrep_4x2.create.conf)
  2. Install Tendrl via tendrl-ansible there, using snapshot builds
  3. Log into the Tendrl web interface as an admin user
  4. Pick one storage server node and break rpm repofile of tendrl repo there by changing baseurl to something invalid (eg. baseurl=file://root/tendrl-repo-which-does-not-exist/).
  5. Verify that installation of tendrl-gluster-integration is not possible on the selected machine.
  6. Try to import the cluster via Tendrl web ui
  7. Wait until the import task finishes (it's expected to fail)
  8. Check the details in task details page.
  9. Moreover fetch the data from jobs/${JOB_ID}/messages api call as well:
$ curl "${TENDRL_SERVER}/api/1.0/jobs/${JOB_ID}/messages" -H "Authorization: Bearer ${TENDRL_TOKEN}" > import_cluster.json
$ jq '.' import_cluster.json > import_cluster.pretty.json

Actual Results

Even though that the installation of tendrl-gluster-integration package failed on one storage server, there is no direct indication of this happening on the page. The first error is about missing configuration file, which is a consequence of the missing package:

screenshot_20170920_221644

The other way to check that no information about installation fail is to grep for details in the pretty version of task messages.

No error related to the package name:

$ grep -i tendrl-gluster-integration import_cluster.pretty.json
      "message": "Installing tendrl-gluster-integration on Node 176a2ed6-4798-47c2-afbe-c66a32b7cd51"
      "message": "Installing tendrl-gluster-integration on Node 6681d694-76d4-4fd7-b025-f2d9e10825e4"
      "message": "Installing tendrl-gluster-integration on Node 5304d61c-8e64-4153-9e6e-200e4bca8284"
      "message": "Installing tendrl-gluster-integration on Node ca9f277e-fe37-4cdf-a020-09069c79aa83"
      "message": "Generating configuration for tendrl-gluster-integration on Node ca9f277e-fe37-4cdf-a020-09069c79aa83"
      "message": "Generating configuration for tendrl-gluster-integration on Node 5304d61c-8e64-4153-9e6e-200e4bca8284"
      "message": "Running tendrl-gluster-integration on Node 5304d61c-8e64-4153-9e6e-200e4bca8284"
      "message": "Generating configuration for tendrl-gluster-integration on Node 176a2ed6-4798-47c2-afbe-c66a32b7cd51"
      "message": "Running tendrl-gluster-integration on Node 176a2ed6-4798-47c2-afbe-c66a32b7cd51"
      "message": "Generating configuration for tendrl-gluster-integration on Node 6681d694-76d4-4fd7-b025-f2d9e10825e4"
      "message": "Running tendrl-gluster-integration on Node 6681d694-76d4-4fd7-b025-f2d9e10825e4"

All errors reported by Tendrl (just 2 in this case) are just a consequence of the installation failure:

$ grep -i error import_1_fail_single.pretty.json
      "message": "Failure in Job d9385c31-32b0-4be0-88b5-ef67ca2b833e Flow tendrl.flows.ImportCluster with error:\nTraceback (most recent call last):\n  File \"/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py\", line 212, in process_job\n    the_flow.run()\n  File \"/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py\", line 81, in run\n    raise ex\nIOError: [Errno 2] No such file or directory: '/etc/tendrl/gluster-integration/gluster-integration_logging.yaml'\n"
    "priority": "error",
      "message": "Failure in Job b2b8bc97-0dfc-4bac-9f46-8b53153e3c9b Flow tendrl.flows.ImportCluster with error:\nTraceback (most recent call last):\n  File \"/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py\", line 221, in process_job\n    raise FlowExecutionFailedError(_msg)\nFlowExecutionFailedError: Cannot mark job as 'finished', current job status invalid\n"
    "priority": "error",

Expected Results

Tendrl should report an error about the fact that installation of the package failed, with details indicating why is that (based on yum error) if possible.

Additional Details

This problem was discussed on A daily meeting of Tendrl developers and reported here with full reproducer based on agreement with @r0h4n .

shtripat pushed a commit to shtripat/bridge_common that referenced this issue Oct 17, 2017
While installation of gluster-integration, modified the logic to
check for any errors reported by ansible and populate it back to
the job status.

tendrl-bug-id: Tendrl/node-agent#627
Signed-off-by: Shubhendu <[email protected]>
shtripat pushed a commit to shtripat/bridge_common that referenced this issue Oct 17, 2017
While installation of gluster-integration, modified the logic to
check for any errors reported by ansible and populate it back to
the job status.

tendrl-bug-id: Tendrl/node-agent#627
Signed-off-by: Shubhendu <[email protected]>
@shtripat
Copy link
Member

@mbukatov can you please verify this and mark as closed.

@mbukatov
Copy link
Contributor Author

Using snapshot builds:

# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-api-httpd-1.5.3-20171013T082716.a2f3b3f.noarch
tendrl-commons-1.5.3-20171026T083412.fb4b67f.noarch
tendrl-grafana-plugins-1.5.3-20171026T085652.a850a43.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171026T085652.a850a43.noarch
tendrl-node-agent-1.5.3-20171026T102825.8999d5d.noarch
tendrl-notifier-1.5.3-20171011T200310.3c01717.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171027T071719.c55fbe7.noarch

I reproduced the issue and see that the problem is detected and reported correctly:

screenshot_20171030_173833

For comparison, the same grep as used in the original report:

$ grep -i error verify_import_1_fail_single.pretty.json
      "message": "Could not install tendrl-gluster-integration on Node feed6426-0474-4f58-92b4-2c178b9256c3Error: https://copr-be.cloud.fedoraproject.org/results/tendrl/tendrl/epel-7-x86_64-noo/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found\nTrying other mirror.\nTo address this issue please refer to the below knowledge base article \n\nhttps://access.redhat.com/articles/1320623\n\nIf above article doesn't help to resolve this issue please create a bug on https://bugs.centos.org/\n\nError: Nothing to do\n"
    "priority": "error",
    "priority": "error",
      "message": "Failure in Job 0b95d368-4842-43c4-a85f-7de72fcd9163 Flow tendrl.flows.ImportCluster with error:\nTraceback (most recent call last):\n  File \"/usr/lib/python2.7/site-packages/tendrl/commons/jobs/__init__.py\", line 218, in process_job\n    the_flow.run()\n  File \"/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/__init__.py\", line 84, in run\n    raise ex\nAtomExecutionFailedError: Atom Execution failed. Error: Error executing atom: tendrl.objects.Cluster.atoms.ImportCluster on flow: Import existing Gluster Cluster\n"
    "priority": "error",

And full message as returned by api:

  {
    "publisher": "node_agent",
    "job_id": "0b95d368-4842-43c4-a85f-7de72fcd9163",
    "timestamp": "2017-10-30T16:31:56.048181+00:00",
    "caller": {
      "function": "import_gluster",
      "line_no": 71,
      "filename": "/usr/lib/python2.7/site-packages/tendrl/commons/flows/import_cluster/gluster_help.py"
    },
    "payload": {
      "message": "Could not install tendrl-gluster-integration on Node feed6426-0474-4f58-92b4-2c178b9256c3Error: https://copr-be.cloud.fedoraproject.org/results/tendrl/tendrl/epel-7-x86_64-noo/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found\nTrying other mirror.\nTo address this issue please refer to the below knowledge base article \n\nhttps://access.redhat.com/articles/1320623\n\nIf above article doesn't help to resolve this issue please create a bug on https://bugs.centos.org/\n\nError: Nothing to do\n"
    },
    "priority": "error",
    "parent_id": null,
    "node_id": "feed6426-0474-4f58-92b4-2c178b9256c3",
    "cluster_id": null,
    "flow_id": "2f94a48a-05d7-408c-b400-e27827f4edef",
    "message_id": "52bdaee3-c2ae-4b33-ae00-35a552c88e16"
  },

The only problem with the error message I notice is the fact that node on which the installation failed is referenced via it's id feed6426-0474-4f58-92b4-2c178b9256c3 which is not immediately readable for admin reading the error.

@mbukatov
Copy link
Contributor Author

@shtripat How could I translate the node id used in the message to hostname? Is it described somewhere?

@mbukatov
Copy link
Contributor Author

Also, after the error, I see that the cluster looks fine in cluster list of tendrl ui:

screenshot_20171030_175912

But some values related to the affected node are not there, such as "Brick Status" widget on Host Dasboard:

screenshot_20171030_180537

Is this ok?

I have additional question: what should I do when an error like this happens during import cluster, assuming I'm able to find out the root cause and fix it as in this case?

@Tendrl/tendrl-qe @Tendrl/tendrl-core

@shtripat
Copy link
Member

@shtripat How could I translate the node id used in the message to hostname? Is it described
somewhere?

It would need changes in code to show node's FQDN instead of node-id here in message. For the time being to correlate you can refer nodes/{node-id}/NodeContext/fqdn for finding out the node for which the installation has failed.

I will send a separate PR for the same anyway.

Regarding Also, after the error, I see that the cluster looks fine in cluster list of tendrl ui: as long as provisioner node got gluster integration installed and few other nodes as well, cluster details would be pushed to central store and details visible in UI (with few missing details for the node failed installation). But the import is marked failed correctly.

@brainfunked @r0h4n @nthomas-redhat I remember we talked about tagging the cluster with error if the import failed and depict the same in UI. Comments?

@shtripat
Copy link
Member

@mbukatov sent Tendrl/commons#768 for showing FQDN of node in log messages instead of node-ids

@r0h4n
Copy link
Contributor

r0h4n commented Nov 2, 2017

Fixed: Tendrl/commons@33ac94f

@mbukatov
Copy link
Contributor Author

mbukatov commented Nov 2, 2017

Using snapshot builds:

[root@mbukatov-usm1-server ~]# rpm -qa | grep tendrl | sort
tendrl-api-1.5.3-20171102T102141.e899dff.noarch
tendrl-api-httpd-1.5.3-20171102T102141.e899dff.noarch
tendrl-commons-1.5.3-20171101T103313.c987736.noarch
tendrl-grafana-plugins-1.5.3-20171101T130858.f752f23.noarch
tendrl-grafana-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-monitoring-integration-1.5.3-20171101T130858.f752f23.noarch
tendrl-node-agent-1.5.3-20171101T112542.0d676e6.noarch
tendrl-notifier-1.5.3-20171030T164233.702f1a5.noarch
tendrl-selinux-1.5.3-20171013T090621.ffb1b7f.noarch
tendrl-ui-1.5.3-20171102T121438.a0b889b.noarch

I reproduced the issue and see that the problem is detected and reported correctly, with a full hostname of the affected node:

error
Could not install tendrl-gluster-integration on Node mbukatov-usm1-gl3.example.comError: https://copr-be.cloud.fedoraproject.org/results/tendrl/tendrl/epel-7-x86_64-reproducing-issue/repodata/repomd.xml: [Errno 14] HTTPS Error 404 - Not Found Trying other mirror. To address this issue please refer to the below knowledge base article https://access.redhat.com/articles/1320623 If above article doesn't help to resolve this issue please create a bug on https://bugs.centos.org/ Error: Nothing to do
02 Nov 2017 01:57:47 

So I consider this verified.

@mbukatov
Copy link
Contributor Author

mbukatov commented Nov 2, 2017

Fixed in current master branch, as noted in #627 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants