Reflect availability and health of ODP service #789

forman · 2018-10-01T09:59:48Z

Expected behavior

Cate Desktop should "know" if the CCI ODP service is available and should clearly display to to users its health status.

Actual behavior

Users receive error messages when downloading and accessing data (usually connection time-out errors). To users it appears as if Cate was not working correctly.

Steps to reproduce the problem

Download or access ODP data sources, when ODP services are down.

Specifications

Cate 1.0 - 2.0.dev20

forman · 2018-10-01T11:12:48Z

Added label "external" because resolution requires new ODP web service.

forman · 2018-10-16T14:14:18Z

Here is some example JSON, that could be returned by the health care service:

{
  "services": {
    "CSW": {
      "status": "OK"
    },
    "WCS": {
      "status": "OK"
    },
    "ESGF": {
      "status": "OK"
    },
    "OPENDAP": {
      "status": "SLOW",
      "reason": "..."
    },
    "HTTP": {
      "status": "OK"
    },
    "FTP": {
      "status": "DOWN",
      "reason": "..."
    }
  },
  "anouncements": [
    {
      "published": "2018-12-06T10:20:13",
      "status": "DOWNTIME",
      "services":  ["CSW"],
      "period":  ["2019-01-01", "2019-01-03"],
      "title": "Catalogue Service Downtime",
      "description": "The ODP CSW will be down from 2019-01-01 to 2019-01-03 for maintenance reasons."
    },
    {
      "published": "2018-11-23T14:06:31",
      "period":  ["2019-02-10", "2019-02-12"],
      "services":  ["OPENDAP", "CSW", "WCS", "ESGF"],
      "status": "LOWBANDWIDTH",
      "title": "Service Migration",
      "description": "All ODP services will be moved to new infrastructure. From 2019-01-01 to 2019-01-03 you may observe low bandwidth."
    }
  ]
}

cpaulcox · 2018-10-17T07:38:00Z

Is the services section meant to be populated as a result of polling the origin servers? If so, then:

Why not poll directly? It will be simpler and less reliant on the availability of the "health care service". Making this intermediary service highly available will incur further cost and complexity - "who monitors the monitor".
Polling alone will not be able to determine a reason for being "DOWN". If this were updated manually it would unlikely to be done in a timely fashion, reasons for failure are often only known after detailed root cause analysis once the systems are back on-line and generally I'd view this as bad security practice as it could result in information leakage which would aid potential attackers.
Polling is unlikely to be able to identify likely connection timeouts unless the target is completely offline due to differences in network paths, inherent latency jitter with the Internet and typically polling requests do simple HEAD requests of a static page.
Slowness will also be hard to determine reliably again due to latency issues which are often route specific and highly transient.

forman · 2018-10-18T11:03:24Z

Our aim is to use some RESTful meta-service API that we can use from the CCI Toolbox. Again, we don't care about how this will be implemented on the server side. Timeouts on the clients may have various reasons - we want to know what the status on the server side.

forman · 2018-10-18T11:06:28Z

For example we just received a mail from Alison saying

Just to let you know that there was an issue with the ESGF update that we deployed yesterday, and to fix it, the OPeNDAP (and other ESGF access e.g. HTTP, WMS) will need to be taken offline this afternoon. I’ll let you know as soon as it’s back up and running, but it may be down all afternoon unfortunately. The portal front end and anonymous ftp download should be unaffected.

This is the stuff that we would like to pass over to our users in advance.

cpaulcox · 2018-10-18T11:35:40Z

I still don't understand how you expect the services section to be updated? If you want to know the status on the server side it suggests a manual update, which as I've mentioned before won't be workable for unscheduled outages, or some integration on-site with the opendap servers.

forman · 2018-10-18T15:03:58Z

I still don't understand how you expect the services section to be updated?

I don't know. I expect, some experts will find a solution.

E.g. using https://www.nagios.org/

JanisGailis · 2018-10-19T11:02:15Z

Just to chime in this discussion a bit. Here are a few examples of how widely used and known services convey status information to their users:

http://status.gandi.net/timeline
https://status.twitterstat.us/#
https://status.status.io/

How exactly the status of a particular system of a particular service is determined and updated is of course specific to each system. From the users' perspective, however, a trusted, machine readable channel is provided.

forman · 2018-10-19T14:13:24Z

Thanks @JanisGailis !

cpaulcox · 2018-10-22T08:24:57Z

Interesting. The gandi.net one illustrates a couple of points that I'm trying to make above.

The incidents are in the past. I would have expected live incident reporting would be more useful.
Their reporting is leaking information about their services and tech stack. This feed should be secured IMHO.

forman · 2018-11-27T09:28:57Z

I'm going to address this now by separating network errors from others, so the GUI can show a different error dialog.

forman · 2018-11-29T14:46:56Z

Now showing the following error dialogs:

forman added gui ds feature_request ux enhancement external labels Oct 1, 2018

forman added the in_progress label Nov 27, 2018

forman removed the in_progress label Nov 29, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reflect availability and health of ODP service #789

Reflect availability and health of ODP service #789

forman commented Oct 1, 2018 •

edited

Loading

forman commented Oct 1, 2018

forman commented Oct 16, 2018

cpaulcox commented Oct 17, 2018

forman commented Oct 18, 2018

forman commented Oct 18, 2018

cpaulcox commented Oct 18, 2018

forman commented Oct 18, 2018 •

edited

Loading

JanisGailis commented Oct 19, 2018 •

edited

Loading

forman commented Oct 19, 2018

cpaulcox commented Oct 22, 2018

forman commented Nov 27, 2018

forman commented Nov 29, 2018 •

edited

Loading

Reflect availability and health of ODP service #789

Reflect availability and health of ODP service #789

Comments

forman commented Oct 1, 2018 • edited Loading

Expected behavior

Actual behavior

Steps to reproduce the problem

Specifications

forman commented Oct 1, 2018

forman commented Oct 16, 2018

cpaulcox commented Oct 17, 2018

forman commented Oct 18, 2018

forman commented Oct 18, 2018

cpaulcox commented Oct 18, 2018

forman commented Oct 18, 2018 • edited Loading

JanisGailis commented Oct 19, 2018 • edited Loading

forman commented Oct 19, 2018

cpaulcox commented Oct 22, 2018

forman commented Nov 27, 2018

forman commented Nov 29, 2018 • edited Loading

forman commented Oct 1, 2018 •

edited

Loading

forman commented Oct 18, 2018 •

edited

Loading

JanisGailis commented Oct 19, 2018 •

edited

Loading

forman commented Nov 29, 2018 •

edited

Loading